Scraping Odds from Oddschecker using Python

Riz Dusoye
4 min readSep 26, 2024

--

In the world of sports betting and political forecasting, having access to accurate and up-to-date odds is crucial. Oddschecker, a popular odds comparison website, aggregates betting odds from various bookmakers, making it a great resource for bettors and analysts alike. In this article, we’ll look at how to use Python to scrape betting odds from Oddschecker, focusing on political betting markets as an example.

Setting Up the Environment

Before we dive into the scraping process, let’s set up our Python environment. We’ll need several libraries to handle web scraping, data processing, and concurrent execution. Here are the key libraries we’ll be using:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import time
import concurrent.futures

Although working today, Oddschecker frequently update their site layout as to discourage people from scraping. Whilst we were once able to simply use BeautifulSoup to download the page contents, we now require the use of Selenium to attempt to look as human as possible to the site, and it’s likely that other amendments to the code will be required in the future. BeautifulSoup is still used to help parse the html, with concurrent.futures being used to help us scrape multiple pages in parallel.

Configuring Selenium WebDriver

To use Selenium, we need to set up a WebDriver. In this case, we’re using Chrome:

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36')

service = Service("/path/to/chromedriver")
driver = webdriver.Chrome(service=service, options=chrome_options)

These options help us run Chrome in headless mode (without opening a visible browser window) and avoid detection as a bot.

Identifying Target URLs

To grab the list of URLs we’re looking to scrape, we can either create a list manually or automatically extract them from the Oddschecker’s sitemap. I’ve found that the automated retrieval of URLs tend to omit a number of odds pages that we’d be interested in, so manually listing them is probably the better method.

For the automated approach:

# URL of the sitemap
sitemap_url = "https://www.oddschecker.com/sport/politics/sitemap.xml"

try:
# Load the sitemap page
driver.get(sitemap_url)

time.sleep(5)

xml_content = driver.page_source
soup = BeautifulSoup(xml_content, 'xml')

url_tags = soup.find_all('loc')
urls = [url_tag.text for url_tag in url_tags]

print(f"Found {len(urls)} URLs.")
print(urls)

finally:
# Close the browser
driver.quit()

Web Scraping with Selenium and BeautifulSoup

Now that we have our target URLs, let’s create a function to extract odds data from a single page:

def extract_odds(url):
driver = webdriver.Chrome(service=service, options=chrome_options)
try:
driver.get(url)
time.sleep(6) # Wait for page to load
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
odds_table = soup.find('tbody', id='t1')

if not odds_table:
print(f"No odds table found for URL: {url}")
return None

odds_data = []
for row in odds_table.find_all('tr'):
bet_name = row.find('a', class_='popup').text.strip()
odds_dict = {'Bet': bet_name}

for td in row.find_all('td', class_=lambda x: x and ('o' in x.split() or 'bs' in x.split())):
bookmaker = td.get('data-bk')
decimal_odds = td.get('data-odig')
if bookmaker and decimal_odds:
odds_dict[bookmaker] = float(decimal_odds)

odds_data.append(odds_dict)

df = pd.DataFrame(odds_data).set_index('Bet')
df['URL'] = url
return df
finally:
driver.quit()

This function navigates to the URL, waits for the page to load, extracts the odds table, and structures the data into a pandas DataFrame.

US Presidential Election odds example

Handling Multiple Pages

To speed up the scraping process, we can use concurrent.futures to scrape multiple pages in parallel:

user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
# ... more user agents ...
]

dataframes_list_oc = []

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = []
for i, url in enumerate(urls):
user_agent = user_agents[i % len(user_agents)]
futures.append(executor.submit(extract_odds, url, user_agent))

if (i + 1) % 5 == 0 or i == len(urls) - 1:
for future in concurrent.futures.as_completed(futures):
df = future.result()
if df is not None:
dataframes_list_oc.append(df)
futures = []

all_data_oc = pd.concat(dataframes_list_oc).reset_index()

This code creates a thread pool with 5 workers, allowing us to scrape up to 5 pages simultaneously. It also rotates through different user agents to reduce the chances of being blocked.

Data Cleaning and Preprocessing

After scraping, we may need to clean and preprocess our data. This could involve handling missing values, normalizing bookmaker names, or converting odds formats. The specific steps will depend on the structure of the scraped data and the intended use.

Challenges and Solutions

Web scraping comes with several challenges:

  1. Dynamic content: Oddschecker uses JavaScript to load some content, which is why we use Selenium instead of simpler scraping libraries.
  2. Site structure changes: The HTML structure of the site may change, breaking our scraper. Regular maintenance is necessary.
  3. Rate limiting: To avoid overloading the server, we implement delays between requests and use concurrent scraping judiciously.
  4. Bot detection: We use rotating user agents and mimic human behavior (e.g., adding delays) to avoid being blocked.

Conclusion

Scraping betting odds from Oddschecker using Python provides valuable data for analysis and decision-making in the betting world. By using tools like Selenium and BeautifulSoup, and implementing concurrent scraping, we can efficiently collect odds data from multiple markets.

Remember that web scraping should be done responsibly and in compliance with the website’s terms of service. Always ensure that your use of the scraped data is legal and ethical.

This scraping technique can be adapted for other websites and use cases, opening up possibilities for data analysis, market research, and more. Happy scraping!

--

--

Riz Dusoye
Riz Dusoye

Written by Riz Dusoye

Random assortment of thoughts on data in sports & finance. www.dusoye.com

No responses yet