LinkedIn Job Posting Scraper

This is a basic manual on how you could set up a LinkedIn scraper using Python.

Python Code (save .txt and change extension to .py) 

Manual

GitHub Project



Creating a LinkedIn Job Posting Scraper in Python involves using web scraping techniques to extract data from LinkedIn job listings.

The code in this project opens up Chrome (Chromedriver), searches for job postings and retrieves the data; job title, company, location, job ID, link to post and job decription, then compiles the information and saves it into a comma separated values (csv) file. 

Keep in mind that web scraping LinkedIn is against their terms of service, so it's essential to use this information responsibly and consider using LinkedIn's official API if you need data for legitimate purposes.

I created this guide for educational use, to demonstrate how to extract job listings automatically so they can be imported into Excel, filtered and tracked more efficiently. Some terms used in this project are defined in a glossary in case you're unfamiliar with them. 

What you will need

You’ll need the following libraries / packages / modules:
(Install instructions in the next section and click on the name of the term to get their definition in the glossary)

BeautifulSoup

chromedriver_autoinstaller

os (python module)

Pandas

Selenium

Install Python

I recommend Anaconda to install Python and packages. This distribution has many nice feats and it makes it very easy to code, debug and manage libraries/packages/modules

Nevertheless, there are many ways to install Python as shown below.



Step 1- Check if Python is Installed

Before you begin, check if Python is already installed on your computer. Open your command prompt (Windows) or terminal (macOS and Linux) and type the following command:

    python --version

If Python is installed, it will display the version number (e.g., Python 3.9.1). If it's not installed, then install Python.


Step 2 - Download Python

I personally use Anaconda, but below you can find a couple of alternatives to download the distribution of your choice.

When choosing the version, it would be best to choose the latest version unless you have a specific reason to download an earlier one. Then download the installer, run it and follow the instructions. If you are not sure as to the selections offered in the installer, just keep it simple and select only the basics as you can. 

Python.org
https://www.python.org/downloads/

Anaconda
https://www.anaconda.com/download

Microsoft Store Package
https://apps.microsoft.com/detail/9pjpw5ldxlz5?ocid=webpdpshare


Step 3 - Verify Python Installation

Once the installation is complete, open your command prompt or terminal again and type/run the same command on step 1:  

    python --version

You should see the installed Python version, if not, try re-installing or choose another distribution.


Step 4 - Install a Code Editor (Optional)

While Python includes the IDLE development environment, many developers prefer using code editors like Visual Studio Code, PyCharm, or Jupyter Notebook (Jupyter is included with Anaconda) for a more robust coding experience. Download and install a code editor of your choice.


Visual Studio Code

Mac, Windows and Linux
https://code.visualstudio.com/download


PyCharm

Mac
https://www.jetbrains.com/pycharm/download/?section=mac

Windows
https://www.jetbrains.com/pycharm/download/?section=windows 


Anaconda

Mac, Windows and Linux
https://www.anaconda.com/download

Install the Libraries


I also recommend Anaconda to manage these libraries and/or other packages, in my opnion it’s one of the easiest to way manage Python on your machine. However, if you want to install them via terminal/prompt, you can run these commands (PIP package needed, instructions available at the bottom of this section).

JupyterLab is a great notebook to test/debug your code. It would be best to create a new environment for this project, to avoid any conflicts with other project you might be working on.


# Install Selenium
pip install chromedriver_autoinstaller

# Install Pandas
pip install pandas

# Install BeautifulSoup
pip install beautifulsoup4


Optional:

# PIP Installation:

# Linux / MacOS
python get-pip.py

# Windows
py get-pip.py

The Code

Note: A pound sign # means that that line is a comment, thus not run but serving as text for reference, instructions, descriptions, etc.

This script was developed and tested/debugged using Jupyter Labs, however it can be used with Visual Studio , PyCharm or other notebook distributions you might find better for you. 

The content was divided into parts, it is recommended to test or debug each part before moving to the next one. There are comments along the whole project that I thoujght might be useful for you. If you have additional comments please shoot me an email at  


Part 1 - Import packages and libraries

Python packages are a set of python modules, while python libraries are a group of python functions aimed to carry out special tasks.

We add the following lines of code at the top of our py file (Python file with a.py extension) for this:

						
	# Part 1 - Import necessary packages and libraries						
	import time
	import random
	import os 							
	import pandas as pd
	from bs4 import BeautifulSoup
	from selenium import webdriver 							
	from selenium.webdriver.support.ui import WebDriverWait 							
	from selenium.webdriver.support import expected_conditions as EC
	from selenium.webdriver.common.by import By
	import chromedriver_autoinstaller
						

Part 2 - Set up Chromedriver

ChromeDriver is a separate executable or a standalone server that WebDriver (Selenium in this case) uses to launch Google Chrome. In our project, WebDriver refers to a collection of APIs used to automate the testing of web applications.

We will use it as an emulator of what you would do as a Chrome user, such as clicking, copying and pasting the jobs data into a csv file and such.

It’s done by using the code below: (Paste it below the code from Part 1 and keep adding them accordingly)

						
	# Part 2 - Check latest chromedriver version and install automatically if necessary 
	chromedriver_autoinstaller.install()

	# Configure Chrome options
	options = webdriver.ChromeOptions()
	options.add_argument("--start-maximized")

	# Launch chromedriver maximized
	browser = webdriver.Chrome(options=options)
						

Part 3 - Open LinkedIn in search mode and set up how many pages to retrieve

The url to be used now, was copied and pasted here, by opening Chrome (without logging in to LinkedIn), search for the keywords “VBA Analyst” and “Toronto” as the location.

Url with keywords in red (the %20 means "space", as in "VBA Analyst"):
https://www.linkedin.com/jobs/search/?keywords=VBA%20Analyst&location=Toronto&position=1&pageNum=0

You can modify keywords (job and location) as needed.

						
	# Part 3 - Open LinkedIn job search page (modify keywords as needed)
	browser.get(f'https://www.linkedin.com/jobs/search/?keywords=Business%20Analyst&location=Toronto&position=1&pageNum=0')

	# Set the number of pages to scrape
	pages: int = 10						
						

Part 4 - Loop through the number of pages to retrieve postings

When the results are displayed at the left side of the browser (as of July 2024), at the very bottom there is a button with the likes of “Show More Jobs”, each time the button appears will be a page, resulting in the number of pages we will be scraping.


	# Part 4 - Loop through the specified number of pages to retrieve job postings
	for i in range(pages): 
		print(f'Scraping page {i + 1}')
		browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
 
		try:
			# Click on the "see more jobs" button if present
			element = WebDriverWait(browser, 5).until(
				EC.presence_of_element_located(
					(By.XPATH, "/html/body/div[1]/div/main/section[2]/button")
				)
			)
			element.click()
		except Exception:
			pass
						

Part 5 - Scrape!

In this section we’ll use BeautifulSoup to parse the html of the website so we can find elements from the html code more efficiently.

Notice the indentation in the code block below, it's very important to keep indentations consisten in the py file.

						
	# Part 5 - Scrape job postings						
	jobs = []
	soup = BeautifulSoup(browser.page_source, "html.parser")
	job_listings = soup.find_all("div",class_="base-card")

	for job in job_listings:
	
		job_title = job.find("h3", class_="base-search-card__title").text.strip()
		job_company = job.find("h4", class_="base-search-card__subtitle").text.strip()
		job_location = job.find("span", class_="job-search-card__location").text.strip()
		apply_link = job.find("a", class_="base-card__full-link")["href"]
		job_ID = apply_link[apply_link.find('?position=')-10:apply_link.find('?position=')]

		browser.get(apply_link)
		time.sleep(random.choice(list(range(5, 11))))

		try:
			description_soup = BeautifulSoup(browser.page_source, "html.parser")
			job_description = description_soup.find("div", class_="description__text description__text--rich").text.strip()
		except AttributeError:
			job_description = None

		jobs.append({
			"job ID": job_ID,
			"title": job_title,
			"company": job_company,
			"location": job_location,
			"link": apply_link,
			"job description": job_description,
		})

						

Part 6 - Save data into a csv file, exclude index column

The scraped data is saved into a csv file located in the same directory as the py file.

						
	# Part 6 - Save data into a csv file, exclude index column
	df = pd.DataFrame(jobs)
	df.to_csv("jobs3.csv", index=False)
						

Contact Antonio on Linkedin.