Berthony Sully - Portfolio

Introduction

This project automates the process of gathering job listings from the Haitian employment portal JobPaw ↗. By automating navigation through JobPaw’s listings, the repository enables offline analysis of available job opportunities and supports further natural language processing or data-science tasks.

Core Objectives

Collect Job Links
The initial goal is to crawl JobPaw’s listings, extracting the unique URLs for each job posting.
Retrieve Job Details
With the gathered links, the project fetches and stores the detailed descriptions of each job into a structured format for analysis.

Technologies

Python : a powerful, easy-to-read programming language widely used in web development, data science, and automation.
Jupyter Notebook : an open-source web application that lets you write and run code, visualize data, and document your workflow—all in one place

Project Structure

├── getLinksScript.py         # Scrapes all job posting URLs from JobPaw
├── getJobDetailsScript.py    # Visits each URL and captures job info
├── getLinks-nb.ipynb         # Notebook version of the link scraper
├── getJobDetails-nb.ipynb    # Notebook version of the details scraper
├── dataProcessing-nb.ipynb   # Prototype workflow for analyzing the results
├── jobPawLinks.xlsx          # Output of the link scraping step
├── jobDetails.xlsx           # Output of the job detail extraction step
└── docs/                     # Additional documentation or helper scripts

How It Works

Link Scraper (getLinksScript.py)
- Requests the main listing pages.
- Extracts the URL for each job post.
- Saves the aggregated links into an Excel file (jobPawLinks.xlsx).
Details Scraper (getJobDetailsScript.py)
- Reads the previously generated Excel file.
- For each link, requests the job description page.
- Parses the content (title, company, description, etc.) and saves it to jobDetails.xlsx.

Both scripts use libraries such as :

requests for HTTP requests,
beautifulsoup4 for HTML parsing,
pandas and openpyxl for structuring and writing Excel files.

Usage example

# Clone the repository
git clone https://github.com/htsull/Jobs_webScrapping.git

# Install dependencies
pip install -r requirements.txt

# Step 1: Fetch job posting links
python getLinksScript.py

# Step 2: Retrieve details for each link
python getJobDetailsScript.py

Results and Further Work

The resulting Excel sheets enable deeper insights—such as filtering positions, analyzing trends, or exploring text for keywords. Future enhancements might include:

Automating analysis pipelines for new data,
Adding scheduling for periodic scraping,
Supporting additional job portals.

For more details, visit the GitHub repository.

Legal & Ethical Considerations

Scraping should always respect the website’s terms of service and local regulations. Excessive automated access can lead to IP blocking or legal issues. This project is provided for educational purposes; use responsibly.