This project focuses on detecting human activities using machine learning techniques. The dataset includes sensor data from 15 users performing various activities. The data is segmented and features are extracted using a sliding window approach. Multiple machine learning models, including logistic regression, decision trees, random forest, and neural networks, are trained and evaluated. The results indicate that Random Forest, XGBoost, and CNN models perform the best in recognizing activities.
This dataset contains activity tracking data collected from 15 users performing 15 different activities each. The activities include actions such as sitting, standing, walking, running, using a computer, and performing object manipulation tasks like picking up items from the floor. Each activity is labeled and categorized, with transition states included for specific scenarios (e.g., stationary to wearing a device). There are some missing data. The frequency of the sensors is 100Hz.
The dataset includes data from 15 individuals (User IDs 13-27) performing various activities. These activities are categorized into sitting (reading, writing, typing, browsing, moving head/body, moving chair), standing (static standing, picking up items from the floor), dynamic (walking, running, taking stairs), and transitions (stationary to wearing a device, putting it back, and stationary with a 15-second delay). Each activity is documented with corresponding experiment numbers and labeled files. Some entries contain missing sensor data, such as accelerometer and gyroscope data.
To better understand the dataset, we provide visualizations of the sensor data collected from the accelerometer, gyroscope, and magnetometer. These figures are located in the figures folder.
This feature extraction process segments accelerometer and gyroscope data using a sliding window approach to compute statistical features for activity recognition. The data is segmented with configurable window sizes (e.g., 100, 200, 300, etc.) and overlap percentages (e.g., 25%, 50%). The step size is calculated as window_size * (1 - overlap). For each segment, the mean, median, variance, and standard deviation are computed for each axis (x, y, z) of the accelerometer and gyroscope data. The process involves segmenting the sensor data for each experiment ID, calculating the features for each window, and appending the user IDs. Missing data is handled appropriately, and the results are aggregated into a DataFrame. The feature files are saved as features/w{window_size}_o{overlap_percentage}_features.csv for multiple configurations.
The uploaded Jupyter Notebook indicates that various machine learning models were utilized. Here’s a list of models and techniques identified from the notebook:
- Logistic Regression - A linear classifier for baseline comparisons.
- Naive Bayes (GaussianNB) - Fast probabilistic model assuming feature independence.
- Support Vector Machines (SVC) - Effective for non-linear and high-dimensional data.
- Decision Tree Classifier - Interpretable tree-based model for classification.
- K-Nearest Neighbors (KNN) - Distance-based classifier for simple patterns.
- Random Forest Classifier - Ensemble method reducing overfitting.
- AdaBoost Classifier - Boosting weak learners for improved performance.
- Gradient Boosting Classifier - Sequentially reduces residual errors.
- XGBoost Classifier - High-performance gradient boosting implementation.
- Neural Networks (Keras):
- Dense Layers - Fully connected layers for classification tasks.
- CNN (Conv1D) - Captures spatial patterns in sequential data.
- LSTM - Captures temporal dependencies in sequential data.
- SimpleRNN - Handles simpler sequential patterns.
- Standard Scaler - Normalizes data for consistent scale.
- Principal Component Analysis (PCA) - Reduces dimensionality for efficiency.
- Cross-Validation (KFold, StratifiedKFold) - Ensures robust model evaluation.
- GridSearchCV and RandomizedSearchCV - Optimizes hyperparameters.
- TSFresh - Automates feature extraction from time-series data.
The results demonstrate that ensemble models such as Random Forest, XGBoost, and Gradient Boost consistently achieved the best performance, with accuracies reaching up to 68% and F1-scores of 0.68 in configurations with larger window sizes (e.g., 500) and higher overlaps (e.g., 50%). Among neural networks, Artificial Neural Networks (ANN) performed competitively, achieving up to 60% accuracy, but were generally outperformed by ensemble models. Larger window sizes and overlaps improved feature representation, enhancing model accuracy and robustness. Ensemble methods emerged as the most reliable classifiers for activity recognition, effectively capturing complex patterns in the data.
| Model | Accuracy | Precision | Recall | F1 Score | K-Fold Score | Stratified Score |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.30 | 0.30 | 0.30 | 0.28 | 0.25 | 0.26 |
| Decision Trees | 0.51 | 0.51 | 0.51 | 0.51 | 0.601 | 0.603 |
| Random Forest | 0.64 | 0.64 | 0.64 | 0.64 | 0.72 | 0.72 |
| Gaussian Naïve Bayes | 0.22 | 0.23 | 0.22 | 0.19 | 0.13 | 0.13 |
| Support Vector Classifier | 0.56 | 0.55 | 0.56 | 0.55 | 0.275 | 0.274 |
| KNN | 0.52 | 0.54 | 0.52 | 0.51 | 0.36 | 0.365 |
| AdaBoost | 0.32 | 0.32 | 0.32 | 0.26 | 0.324 | 0.323 |
| XGBoost | 0.64 | 0.64 | 0.64 | 0.64 | 0.719 | 0.715 |
| Gradient Boost | 0.60 | 0.61 | 0.60 | 0.60 | 0.65 | 0.648 |
| Artificial Neural Networks (ANN) | 0.57 | 0.57 | 0.57 | 0.57 | 0.464 | 0.487 |
| Model | Accuracy | Precision | Recall | F1 Score | K-Fold Score | Stratified Score |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.29 | 0.29 | 0.29 | 0.26 | 0.25 | 0.24 |
| Decision Trees | 0.53 | 0.54 | 0.53 | 0.54 | 0.62 | 0.63 |
| Random Forest | 0.68 | 0.69 | 0.68 | 0.68 | 0.76 | 0.77 |
| Gaussian Naïve Bayes | 0.26 | 0.26 | 0.26 | 0.22 | 0.13 | 0.13 |
| Support Vector Classifier | 0.58 | 0.58 | 0.58 | 0.57 | 0.29 | 0.29 |
| KNN | 0.55 | 0.56 | 0.55 | 0.53 | 0.37 | 0.37 |
| AdaBoost | 0.35 | 0.41 | 0.35 | 0.30 | 0.34 | 0.33 |
| XGBoost | 0.67 | 0.63 | 0.63 | 0.62 | 0.66 | 0.66 |
| Gradient Boost | 0.63 | 0.68 | 0.67 | 0.67 | 0.73 | 0.73 |
| Artificial Neural Networks (ANN) | 0.60 | 0.59 | 0.60 | 0.59 | 0.52 | 0.53 |
| Model | Accuracy | Precision | Recall | F1 Score | K-Fold Score | Stratified Score |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.31 | 0.31 | 0.31 | 0.29 | 0.174 | 0.174 |
| Decision Trees | 0.63 | 0.63 | 0.63 | 0.62 | 0.72 | 0.72 |
| Random Forest | 0.47 | 0.48 | 0.47 | 0.47 | 0.591 | 0.593 |
| Gaussian Naïve Bayes | 0.25 | 0.26 | 0.25 | 0.22 | 0.26 | 0.259 |
| Support Vector Classifier | 0.55 | 0.55 | 0.55 | 0.55 | 0.25 | 0.255 |
| KNN | 0.49 | 0.53 | 0.49 | 0.48 | 0.349 | 0.349 |
| AdaBoost | 0.34 | 0.4 | 0.34 | 0.29 | 0.325 | 0.328 |
| XGBoost | 0.64 | 0.63 | 0.64 | 0.63 | 0.72 | 0.72 |
| Gradient Boost | 0.61 | 0.60 | 0.61 | 0.60 | 0.662 | 0.663 |
| Artificial Neural Networks (ANN) | 0.55 | 0.55 | 0.55 | 0.55 | 0.396 | 0.406 |
| Model | Accuracy | Precision | Recall | F1 Score | K-Fold Score | Stratified Score |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.32 | 0.33 | 0.32 | 0.30 | 0.179 | 0.180 |
| Decision Trees | 0.54 | 0.53 | 0.54 | 0.53 | 0.631 | 0.633 |
| Random Forest | 0.68 | 0.68 | 0.68 | 0.67 | 0.766 | 0.768 |
| Gaussian Naïve Bayes | 0.24 | 0.27 | 0.24 | 0.21 | 0.153 | 0.152 |
| Support Vector Classifier | 0.59 | 0.59 | 0.59 | 0.58 | 0.277 | 0.277 |
| KNN | 0.53 | 0.55 | 0.53 | 0.51 | 0.369 | 0.371 |
| AdaBoost | 0.37 | 0.40 | 0.37 | 0.32 | 0.351 | 0.353 |
| XGBoost | 0.67 | 0.67 | 0.67 | 0.66 | 0.747 | 0.747 |
| Gradient Boost | 0.63 | 0.63 | 0.63 | 0.62 | 0.685 | 0.681 |
| Artificial Neural Networks (ANN) | 0.60 | 0.59 | 0.60 | 0.60 | 0.445 | 0.440 |
| Model | Accuracy | Precision | Recall | F1 Score | K-Fold Score | Stratified Score |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.32 | 0.32 | 0.32 | 0.30 | 0.139 | 0.137 |
| Decision Trees | 0.47 | 0.46 | 0.47 | 0.46 | 0.585 | 0.580 |
| Random Forest | 0.62 | 0.62 | 0.62 | 0.61 | 0.724 | 0.726 |
| Gaussian Naïve Bayes | 0.30 | 0.31 | 0.30 | 0.27 | 0.149 | 0.151 |
| Support Vector Classifier | 0.54 | 0.54 | 0.54 | 0.53 | 0.236 | 0.236 |
| KNN | 0.48 | 0.52 | 0.48 | 0.46 | 0.345 | 0.344 |
| AdaBoost | 0.39 | 0.40 | 0.39 | 0.36 | 0.325 | 0.328 |
| XGBoost | 0.63 | 0.63 | 0.63 | 0.62 | 0.726 | 0.724 |
| Gradient Boost | 0.58 | 0.58 | 0.58 | 0.57 | 0.670 | 0.671 |
| Artificial Neural Networks (ANN) | 0.56 | 0.55 | 0.56 | 0.55 | 0.370 | 0.376 |
| Model | Accuracy | Precision | Recall | F1 Score | K-Fold Score | Stratified Score |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.33 | 0.34 | 0.33 | 0.31 | 0.318 | 0.316 |
| Decision Trees | 0.50 | 0.50 | 0.50 | 0.50 | 0.635 | 0.637 |
| Random Forest | 0.66 | 0.66 | 0.66 | 0.66 | 0.771 | 0.770 |
| Gaussian Naïve Bayes | 0.28 | 0.31 | 0.28 | 0.25 | 0.275 | 0.275 |
| Support Vector Classifier | 0.58 | 0.57 | 0.58 | 0.57 | 0.256 | 0.255 |
| KNN | 0.51 | 0.54 | 0.51 | 0.49 | 0.372 | 0.372 |
| AdaBoost | 0.35 | 0.40 | 0.35 | 0.34 | 0.344 | 0.334 |
| XGBoost | 0.66 | 0.66 | 0.66 | 0.66 | 0.758 | 0.761 |
| Gradient Boost | 0.62 | 0.61 | 0.62 | 0.61 | 0.697 | 0.693 |
| Artificial Neural Networks (ANN) | 0.60 | 0.59 | 0.60 | 0.59 | 0.422 | 0.423 |
| Model | Accuracy | Precision | Recall | F1 Score | K-Fold Score | Stratified Score |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.35 | 0.36 | 0.35 | 0.33 | 0.334 | 0.332 |
| Decision Trees | 0.46 | 0.46 | 0.46 | 0.46 | 0.582 | 0.580 |
| Random Forest | 0.61 | 0.61 | 0.61 | 0.60 | 0.725 | 0.723 |
| Gaussian Naïve Bayes | 0.30 | 0.33 | 0.30 | 0.28 | 0.289 | 0.289 |
| Support Vector Classifier | 0.54 | 0.54 | 0.54 | 0.53 | 0.222 | 0.223 |
| KNN | 0.47 | 0.51 | 0.47 | 0.45 | 0.349 | 0.351 |
| AdaBoost | 0.38 | 0.42 | 0.38 | 0.38 | 0.344 | 0.361 |
| XGBoost | 0.62 | 0.61 | 0.62 | 0.61 | 0.738 | 0.735 |
| Gradient Boost | 0.58 | 0.58 | 0.58 | 0.58 | 0.676 | 0.672 |
| Artificial Neural Networks (ANN) | 0.56 | 0.56 | 0.56 | 0.56 | 0.350 | 0.345 |
| Model | Accuracy | Precision | Recall | F1 Score | K-Fold Score | Stratified Score |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.35 | 0.36 | 0.35 | 0.33 | 0.338 | 0.335 |
| Decision Trees | 0.51 | 0.51 | 0.51 | 0.51 | 0.628 | 0.630 |
| Random Forest | 0.66 | 0.65 | 0.66 | 0.65 | 0.776 | 0.775 |
| Gaussian Naïve Bayes | 0.31 | 0.34 | 0.31 | 0.28 | 0.291 | 0.290 |
| Support Vector Classifier | 0.59 | 0.59 | 0.59 | 0.59 | 0.253 | 0.252 |
| KNN | 0.49 | 0.53 | 0.49 | 0.48 | 0.376 | 0.379 |
| AdaBoost | 0.38 | 0.39 | 0.38 | 0.37 | 0.348 | 0.343 |
| XGBoost | 0.67 | 0.66 | 0.67 | 0.66 | 0.771 | 0.767 |
| Gradient Boost | 0.63 | 0.62 | 0.63 | 0.62 | 0.709 | 0.709 |
| Artificial Neural Networks (ANN) | 0.59 | 0.59 | 0.59 | 0.59 | 0.412 | 0.393 |
| Model | Accuracy | Precision | Recall | F1 Score | K-Fold Score | Stratified Score |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.36 | 0.37 | 0.36 | 0.35 | 0.202 | 0.207 |
| Decision Trees | 0.46 | 0.46 | 0.46 | 0.45 | 0.590 | 0.589 |
| Random Forest | 0.60 | 0.60 | 0.60 | 0.60 | 0.732 | 0.733 |
| Gaussian Naïve Bayes | 0.29 | 0.32 | 0.29 | 0.27 | 0.172 | 0.173 |
| Support Vector Classifier | 0.55 | 0.55 | 0.55 | 0.54 | 0.218 | 0.214 |
| KNN | 0.45 | 0.50 | 0.45 | 0.43 | 0.353 | 0.356 |
| AdaBoost | 0.35 | 0.37 | 0.35 | 0.30 | 0.320 | 0.317 |
| XGBoost | 0.61 | 0.60 | 0.61 | 0.60 | 0.740 | 0.745 |
| Gradient Boost | 0.57 | 0.57 | 0.57 | 0.57 | 0.687 | 0.688 |
| Artificial Neural Networks (ANN) | 0.57 | 0.57 | 0.57 | 0.57 | 0.341 | 0.348 |
| Model | Accuracy | Precision | Recall | F1 Score | K-Fold Score | Stratified Score |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.37 | 0.37 | 0.37 | 0.36 | 0.203 | 0.206 |
| Decision Trees | 0.51 | 0.51 | 0.51 | 0.51 | 0.630 | 0.630 |
| Random Forest | 0.66 | 0.66 | 0.66 | 0.66 | 0.780 | 0.780 |
| Gaussian Naïve Bayes | 0.31 | 0.34 | 0.31 | 0.29 | 0.310 | 0.308 |
| Support Vector Classifier | 0.58 | 0.58 | 0.58 | 0.57 | 0.241 | 0.240 |
| KNN | 0.49 | 0.53 | 0.49 | 0.48 | 0.373 | 0.378 |
| AdaBoost | 0.37 | 0.40 | 0.37 | 0.36 | 0.376 | 0.376 |
| XGBoost | 0.67 | 0.67 | 0.67 | 0.67 | 0.779 | 0.780 |
| Gradient Boost | 0.62 | 0.62 | 0.62 | 0.62 | 0.718 | 0.716 |
| Artificial Neural Networks (ANN) | 0.56 | 0.56 | 0.56 | 0.56 | 0.368 | 0.343 |
For Window Size: 500, Overlap: 25%, the following results were achieved:
- LSTM: 53.25% accuracy
- CNN: 65.78% accuracy
- RNN: 49.43% accuracy
- TSFresh extracted 2,351 features for each window configuration.
- The number of rows generated:
- Window 400, Overlap 25%: 19,146 rows
- Window 400, Overlap 50%: 28,654 rows
- Window 500, Overlap 25%: 15,287 rows
- Window 500, Overlap 50%: 22,789 rows
- PCA retained 95% variance with 5 features.
Random Forest, XGBoost, and CNN emerged as the most effective models, with CNN achieving the highest accuracy for sequential data.


