Linear Regression Model for Predicting Pharmaceutical Sector Performance in New York Stock Exchange
This project develops a linear regression model to predict pharmaceutical sector performance using economic, market, and industry-specific indicators.
git clone https://github.com/wusinyee/NYSE-Pharma-Performance-LR-Model.git
cd NYSE-Pharma-Performance-LR-Model
pip install -r requirements.txt
NYSE-Pharma-Performance-LR-Model/
│
├── data/
│ ├── raw/
│ │ └── .gitkeep
│ └── processed/
│ └── .gitkeep
│
├── notebooks/
│ ├── 1.0-data-preprocessing.ipynb
│ ├── 2.0-exploratory-data-analysis.ipynb
│ └── 3.0-model-development.ipynb
│
├── src/
│ ├── data/
│ │ ├── __init__.py
│ │ └── preprocess.py
│ ├── features/
│ │ ├── __init__.py
│ │ └── build_features.py
│ ├── models/
│ │ ├── __init__.py
│ │ ├── train_model.py
│ │ └── predict_model.py
│ └── visualization/
│ ├── __init__.py
│ └── visualize.py
│
├── tests/
│ ├── __init__.py
│ ├── test_data.py
│ ├── test_features.py
│ └── test_models.py
│
├── .gitignore
├── LICENSE
├── README.md
├── requirements.txt
└── setup.py
This structure follows best practices for organizing a data science project:
data/: Stores raw and processed data files. notebooks/: Contains Jupyter notebooks for exploration and analysis. src/: Houses the main source code of the project. tests/: Includes unit tests for different components. Root directory files for project setup and documentation.
-
Data Collection and Preparation a. Stock Data Collection
- NYSE historical dataset from Kaggle
- S&P 500 index data
- API-fetched pharmaceutical company data b. Economic Data Collection c. Healthcare Data Collection d. Market Sentiment Data Collection e. Data Preprocessing f. Data Integration g. Data Quality Checks h. Feature Engineering i. Data Documentation
-
Exploratory Data Analysis (EDA) a. Analyze variable distributions b. Investigate correlations c. Examine time series characteristics d. Visualize key relationships
-
Feature Selection a. Statistical methods (correlation, VIF, mutual information) b. Domain knowledge application
-
Model Development a. Data splitting (train, validation, test) b. Baseline model implementation c. Advanced model development
- Linear models (Ridge, Lasso)
- Tree-based models (Random Forest, Gradient Boosting)
- Support Vector Regression
- Neural Networks d. Cross-validation
-
Model Optimization a. Hyperparameter tuning b. Ensemble methods exploration
-
Model Evaluation and Selection a. Performance metric comparison b. Model interpretability assessment c. Final model selection
-
Model Interpretation a. Feature importance analysis b. SHAP value analysis
-
Model Validation a. Test set evaluation b. Backtesting c. Sensitivity analysis
-
Deployment Planning a. Deployment system design b. Infrastructure setup c. Prediction pipeline development
-
Documentation and Reporting a. Technical documentation b. Final report preparation c. Visualization creation
-
Stakeholder Presentation a. Presentation preparation b. Key findings and results communication
-
Model Deployment a. Implementation of deployment system b. Testing and quality assurance
-
Monitoring and Maintenance a. Performance monitoring setup b. Retraining schedule establishment c. Version control implementation
-
Compliance and Ethics a. Regulatory compliance review b. Fairness and bias assessment c. Ethical use guidelines development
-
Knowledge Transfer a. User guide creation b. Training session conduction c. Support system setup
-
Impact Assessment a. Model impact measurement b. Efficiency gains quantification c. Stakeholder feedback collection
-
Iterative Improvement a. Regular performance reviews b. Continuous improvement implementation
-
Scaling and Expansion a. Scalability assessment b. Expansion roadmap development
-
Project Closure a. Comprehensive project review b. Lessons learned documentation c. Formal project closure
- Run data preprocessing:
python src/data/preprocess.py
- Perform EDA:
jupyter notebook notebooks/2.0-exploratory-data-analysis.ipynb
- Train the model:
python src/models/train_model.py
- Make predictions:
python src/models/predict_model.py
- Data sources: NYSE, FDA, U.S. Bureau of Economic Analysis
- Features: stock prices, economic indicators, FDA approvals
- Target variable: Pharmaceutical sector daily returns
- Algorithm: Linear Regression
- Key features: [List top 5 features]
- Performance metrics: R-squared, MAE, RMSE
[Brief summary of model performance and key insights]
This project is licensed under the MIT License - see the LICENSE file for details.
[Mandy Wu] - [[email protected]]
Project Link: https://github.com/wusinyee/NYSE-Pharma-Performance-LR-Model