Introduction
This project automates the process of gathering job listings from the Haitian employment portal JobPaw ↗. By automating navigation through JobPaw’s listings, the repository enables offline analysis of available job opportunities and supports further natural language processing or data-science tasks.
Core Objectives
-
Collect Job Links
The initial goal is to crawl JobPaw’s listings, extracting the unique URLs for each job posting. -
Retrieve Job Details
With the gathered links, the project fetches and stores the detailed descriptions of each job into a structured format for analysis.
Technologies
- Python : a powerful, easy-to-read programming language widely used in web development, data science, and automation.
- Jupyter Notebook : an open-source web application that lets you write and run code, visualize data, and document your workflow—all in one place
Project Structure
├── getLinksScript.py # Scrapes all job posting URLs from JobPaw
├── getJobDetailsScript.py # Visits each URL and captures job info
├── getLinks-nb.ipynb # Notebook version of the link scraper
├── getJobDetails-nb.ipynb # Notebook version of the details scraper
├── dataProcessing-nb.ipynb # Prototype workflow for analyzing the results
├── jobPawLinks.xlsx # Output of the link scraping step
├── jobDetails.xlsx # Output of the job detail extraction step
└── docs/ # Additional documentation or helper scripts
How It Works
-
Link Scraper (
getLinksScript.py
)- Requests the main listing pages.
- Extracts the URL for each job post.
- Saves the aggregated links into an Excel file (
jobPawLinks.xlsx
).
-
Details Scraper (
getJobDetailsScript.py
)- Reads the previously generated Excel file.
- For each link, requests the job description page.
- Parses the content (
title
,company
,description
, etc.) and saves it tojobDetails.xlsx
.
Both scripts use libraries such as :
requests
for HTTP requests,beautifulsoup4
for HTML parsing,pandas
andopenpyxl
for structuring and writing Excel files.
Usage example
# Clone the repository
git clone https://github.com/htsull/Jobs_webScrapping.git
# Install dependencies
pip install -r requirements.txt
# Step 1: Fetch job posting links
python getLinksScript.py
# Step 2: Retrieve details for each link
python getJobDetailsScript.py
Results and Further Work
The resulting Excel sheets enable deeper insights—such as filtering positions, analyzing trends, or exploring text for keywords. Future enhancements might include:
- Automating analysis pipelines for new data,
- Adding scheduling for periodic scraping,
- Supporting additional job portals.
For more details, visit the GitHub repository.
Legal & Ethical Considerations
Scraping should always respect the website’s terms of service and local regulations. Excessive automated access can lead to IP blocking or legal issues. This project is provided for educational purposes; use responsibly.