This is a basic manual on how you could set up a LinkedIn scraper using Python.
Creating a LinkedIn Job Posting Scraper in Python involves using web scraping techniques to extract data from LinkedIn job listings.
The code in this project opens up Chrome (Chromedriver), searches for job postings and retrieves the data; job title, company, location, job ID, link to post and job decription, then compiles the information and saves it into a comma separated values (csv) file.
Keep in mind that web scraping LinkedIn is against their terms of service, so it's essential to use this information responsibly and consider using LinkedIn's official API if you need data for legitimate purposes.
I created this guide for educational use, to demonstrate how to extract job listings automatically so they can be imported into Excel, filtered and tracked more efficiently. Some terms used in this project are defined in a glossary in case you're unfamiliar with them.
You’ll need the following libraries / packages / modules:
(Install instructions in the next section and click on the name of the term to get their definition in the glossary)
Before you begin, check if Python is already installed on your computer. Open your command prompt (Windows) or terminal (macOS and Linux) and type the following command:
python --version
If Python is installed, it will display the version number (e.g., Python 3.9.1). If it's not installed, then install Python.
I personally use Anaconda, but below you can find a couple of alternatives to download the distribution of your choice.
When choosing the version, it would be best to choose the latest version unless you have a specific reason to download an earlier one. Then download the installer, run it and follow the instructions. If you are not sure as to the selections offered in the installer, just keep it simple and select only the basics as you can.
Python.org
https://www.python.org/downloads/
Anaconda
https://www.anaconda.com/download
Microsoft Store Package
https://apps.microsoft.com/detail/9pjpw5ldxlz5?ocid=webpdpshare
While Python includes the IDLE development environment, many developers prefer using code editors like Visual Studio Code, PyCharm, or Jupyter Notebook (Jupyter is included with Anaconda) for a more robust coding experience. Download and install a code editor of your choice.
Mac, Windows and Linux
https://code.visualstudio.com/download
Mac
https://www.jetbrains.com/pycharm/download/?section=mac
Windows
https://www.jetbrains.com/pycharm/download/?section=windows
Mac, Windows and Linux
https://www.anaconda.com/download
I also recommend Anaconda to manage these libraries and/or other packages, in my opnion it’s one of the easiest to way manage Python on your machine. However, if you want to install them via terminal/prompt, you can run these commands (PIP package needed, instructions available at the bottom of this section).
JupyterLab is a great notebook to test/debug your code. It would be best to create a new environment for this project, to avoid any conflicts with other project you might be working on.
# Install Selenium
pip install chromedriver_autoinstaller
# Install Pandas
pip install pandas
# Install BeautifulSoup
pip install beautifulsoup4
Optional:
# PIP Installation:
# Linux / MacOS
python get-pip.py
# Windows
py get-pip.py
Note: A pound sign # means that that line is a comment, thus not run but serving as text for reference, instructions, descriptions, etc.
This script was developed and tested/debugged using Jupyter Labs, however it can be used with Visual Studio , PyCharm or other notebook distributions you might find better for you.
The content was divided into parts, it is recommended to test or debug each part before moving to the next one. There are comments along the whole project that I thoujght might be useful for you. If you have additional comments please shoot me an email at
# Part 1 - Import necessary packages and libraries import time import random import os import pandas as pd from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By import chromedriver_autoinstaller
ChromeDriver is a separate executable or a standalone server that WebDriver (Selenium in this case) uses to launch Google Chrome. In our project, WebDriver refers to a collection of APIs used to automate the testing of web applications.
We will use it as an emulator of what you would do as a Chrome user, such as clicking, copying and pasting the jobs data into a csv file and such.
It’s done by using the code below: (Paste it below the code from Part 1 and keep adding them accordingly)
# Part 2 - Check latest chromedriver version and install automatically if necessary chromedriver_autoinstaller.install() # Configure Chrome options options = webdriver.ChromeOptions() options.add_argument("--start-maximized") # Launch chromedriver maximized browser = webdriver.Chrome(options=options)
The url to be used now, was copied and pasted here, by opening Chrome (without logging in to LinkedIn), search for the keywords “VBA Analyst” and “Toronto” as the location.
Url with keywords in red (the %20 means "space", as in "VBA Analyst"):
https://www.linkedin.com/jobs/search/?keywords=VBA%20Analyst&location=Toronto&position=1&pageNum=0
You can modify keywords (job and location) as needed.
# Part 3 - Open LinkedIn job search page (modify keywords as needed) browser.get(f'https://www.linkedin.com/jobs/search/?keywords=Business%20Analyst&location=Toronto&position=1&pageNum=0') # Set the number of pages to scrape pages: int = 10
When the results are displayed at the left side of the browser (as of July 2024), at the very bottom there is a button with the likes of “Show More Jobs”, each time the button appears will be a page, resulting in the number of pages we will be scraping.
# Part 4 - Loop through the specified number of pages to retrieve job postings for i in range(pages): print(f'Scraping page {i + 1}') browser.execute_script("window.scrollTo(0, document.body.scrollHeight);") try: # Click on the "see more jobs" button if present element = WebDriverWait(browser, 5).until( EC.presence_of_element_located( (By.XPATH, "/html/body/div[1]/div/main/section[2]/button") ) ) element.click() except Exception: pass
In this section we’ll use BeautifulSoup to parse the html of the website so we can find elements from the html code more efficiently.
Notice the indentation in the code block below, it's very important to keep indentations consisten in the py file.
# Part 5 - Scrape job postings
jobs = []
soup = BeautifulSoup(browser.page_source, "html.parser")
job_listings = soup.find_all("div",class_="base-card")
for job in job_listings:
job_title = job.find("h3", class_="base-search-card__title").text.strip()
job_company = job.find("h4", class_="base-search-card__subtitle").text.strip()
job_location = job.find("span", class_="job-search-card__location").text.strip()
apply_link = job.find("a", class_="base-card__full-link")["href"]
job_ID = apply_link[apply_link.find('?position=')-10:apply_link.find('?position=')]
browser.get(apply_link)
time.sleep(random.choice(list(range(5, 11))))
try:
description_soup = BeautifulSoup(browser.page_source, "html.parser")
job_description = description_soup.find("div", class_="description__text description__text--rich").text.strip()
except AttributeError:
job_description = None
jobs.append({
"job ID": job_ID,
"title": job_title,
"company": job_company,
"location": job_location,
"link": apply_link,
"job description": job_description,
})
The scraped data is saved into a csv file located in the same directory as the py file.
# Part 6 - Save data into a csv file, exclude index column
df = pd.DataFrame(jobs)
df.to_csv("jobs3.csv", index=False)