Response 403 python requests как исправить
Перейти к содержимому

Response 403 python requests как исправить

  • автор:

Web scraping: how to solve 403 errors

Why am I getting a 403 error status code when web scraping? And what can I do to make it vanish? 3 answers we cover in this article: user agents, HTTP headers, and proxies.

Natasha Lekh

Natasha Lekh

Content

We’re Apify. You can build, deploy, share, and monitor any web scrapers on the Apify platform. Check us out.

Getting a 403 error shouldn’t stop you from extracting publicly available data from any website. But if you’re getting blocked by a 403 error, here are a few things you can do.

What does error 403 mean?

Error 403 (or 403 Forbidden error) is a client error response HTTP status codes from the server side. Another example is the infamous 404 Not Found error, which shows up when a web page can’t be found.

403 Forbidden means that the server has received your request but refuses to approve it. For a regular user, it is usually a sign they don’t have sufficient rights to access a particular web page. A typical example is a request to view the website as an admin if you’re not one.

cat forbidden to illustrate 403 error

However, a 403 error may be a sign of something different when web scraping.

Why am I getting a 403 error when web scraping?

There are several reasons for getting a HTTP 403 error. So like everything in programming, you will need to do some detective work to figure out exactly why 403 Forbidden showed up in your particular case. But if you’re trying not just to visit but also extract the data from a website, the reasons why a 403 Forbidden Error is glaring at you from the screen narrow down to just two:

  1. You need special rights to access this particular web resource — and you don’t have that authorization.
  2. The website recognized your actions as scraping and is politely banning you.

In a way, the second cause is good news: you’ve made a successful request for scraping. The bad news is that these days, websites are equipped with powerful anti-bot protections able to deduce the intentions of their visitors and block them. If the webpage you’re trying to scrape opens normally in a browser but gives you the 403 Forbidden HTTP status code when trying to request access via scraper – you’ve been busted!

How to fix 403 errors when scraping

So we’ve identified that the 403 error was no accident and there’s a clear connection with scraping; now what? There are at least three ways to go about this issue. You can try alternating:

  • User agents (different devices)
  • Request headers (other browsers)
  • Rotating proxies (various locations)

But before we start with solutions, we need to know the challenge we’re up against: how was the website able to detect our scraper and apply anti-scraping protections to it? The short answer is digital fingerprints. Skip the longer answer if you already know enough about browser or user fingerprints.

Bot detection: what are user fingerprints?

These days, every respectable website with high stable traffic has a few tracking methods in place to distinguish real users and bots like our scraper. Those tracking methods boil down to the information the website gets about the user’s machine sending an access request to the server, namely the user’s device type, browser, and location. The list usually goes on to include the operating system, browser version, screen resolution, timezone, language settings, extensions like an ad blocker, and many more parameters, small and big.

Every user has a unique combination of that data. Once individual browsing sessions become associated with the visitor, that visitor gets assigned an online fingerprint. Unlike cookies, this fingerprint is almost impossible to shake off — even if you decide to clear browser data, use a VPN, or switch to incognito mode.

Next step: the website has a browser fingerprinting script in place to bounce visitors with unwelcome fingerprints. Some scripts are more elaborate and accurate than others as they factor in more fingerprint signals, but the end goal is the same: filter out flagged visitors. So if your fingerprint gets flagged as belonging to a bot (for example, after repeatedly trying to access the same web page within a short amount of time), it’s the script’s duty to cut your access from the website and show you some sort of error message, in our case, error 403.

There’s a reason why these techniques are employed in the first place: not all website visitors have good intentions. It is important to keep the spammers, hackers, DDoS attackers, and fraudsters at bay and allow the real users to return as many times as they want. It is also important to let search engine or site monitoring bots do their job without banning them.

Good bots vs. bad bots — what’s the difference for the website server?

But what if we’re just trying to automate a few actions on the website? There’s no harm in extracting publicly available data such as prices or articles. Unfortunately, abuse of web scraping bots led many websites to set strict measures in place to prevent bots from overloading their servers. And while websites have to be careful with singling out human visitors, there’s no incentive for them to develop elaborate techniques (besides robot.txt) to distinguish good and bad bots. It’s easier to ban them all except for search engines. So if we want to do web scraping or web automation, we’re left with no other option but to try and bypass the restrictions. Here’s how you can do it.

4 ways to get rid of 403 errors when web scraping

Change user agents

The most common telltale sign that can make the website throw you a 403 error is your user agent. User-agent is a boring bit of information that describes your device to the server when you’re trying to connect to the server and access the website. In other words, the website uses your user agent info as a token to identify who’s sending an HTTP request to access it. It typically includes the application, operating system, vendor, and/or version. Here’s an example of a user agent token; as you see, it contains the parameters mentioned above.

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36

User agent example

The good news is there are many ways to tinker with and change these parameters manually and try to come off as a different, let’s say, Mozilla+Windows user. But there are rules to it if you want to succeed.

Devices

A small warning before we start: sometimes the user agent term gets closely associated with the type of device: laptop, tablet, mobile, etc., so you might think if you replace the device type, it will be enough to pass as a different user. That’s not enough. The device type info is not standalone; it’s usually paired with a certain OS and/or browser. You wouldn’t normally see an Android mobile device with an iOS using Opera or an iPad with a Windows OS and Microsoft Edge (unless you’re feeling adventurous). These unlikely combinations can unnecessarily attract website’s attention (because they are that — unlikely), and that’s what we’re trying to avoid here.

So even if you try to change the device type, it will require you to rewrite the whole user agent to come off as an authentic, real user instead of a bot. That’s why you need to know how to combine all those bits of info. There are some user agent combos that are more common than others and many free resources that share that information. Here’s one of them, for instance — you can find your own user agent there, too, and how common it is for the web these days.

Libraries

There are also HTTP libraries that offer many user agent examples. The problem with the libraries usually is that they can be either outdated or created for website testing specifically. While the issue with the former is clear, the latter needs a bit of explanation. The user agents from a testing library usually indicate directly that they are sampled from a library. This serves a purpose when you’re testing your website (clear labeling is important for identifying bugs) but not when you’re trying to come off as a real user.

In the end, your best chance at web scraping is to come off as different users for the website by randomizing and rotating several sure-proof user agents. Here’s what that could look like:

import request from 'request'; const userAgentsList = [ 'Mozilla/5.0 (X11; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36' ]; let options = < url: 'https://google.com/doodles', headers: < 'User-Agent': userAgentsList[Math.floor(Math.random() * 3)] >>; 

Or, if you step up your game, you can use a library for randomizing user agents, such as modern-random-ua. Here’s an example of how to randomize user-agents from this library:

import request from 'request'; const userAgentsList = [ 'Mozilla/5.0 (X11; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36' ]; let options = < url: 'https://google.com/doodles', headers: < 'User-Agent': userAgentsList[Math.floor(Math.random() * 3)] >>; 

All this can help you to tinker with your user agent, come up with a list of the most successful cases and use them in rotation. In many cases, diversifying your HTTP requests with different user agents should beat the 403 Forbidden error. But again, it all highly depends on how defensive the website you’re trying to access is. So if the error persists, here is a level-two modification you can apply.

More complex HTTP headers for browsers

Simple bot identification scripts filter out unwanted website visitors by user agent. But more refined scripts include checking users against HTTP headers – k namely, their availability and consistency.

Humans don’t usually visit a website without using some sort of browser as a middleman. These days modern browsers include tons of extra HTTP headers sent with every request to deliver the best user experience (screen size, language, dark mode, etc.)

Since your task is to make it harder for the website to tell whether your HTTP requests are coming from a scraper or a real user, at some point, you will have to change not only the user agent but also other browser headers.

Read more about HTTP headers in our Docs

Complexity

A basic HTTP header section, besides User-Agent, usually includes Referer , Accept , Accept-Language , and Accept-Encoding . A lot of HTTP clients send just basic HTTP header with a simple structure like this:

Compare this with an example of a real-life, complex browser header:

Complex browser header example

Note that it is not only about the browser name (that one is indicated in the user-agent), but rather the sidekick section that goes with it.

Consistency

Some HTTP headers are shorter, some are longer. But the important thing is that they also have to be consistent with user-agent. There has to be a correct match between the user-agent and the rest of the header section. If the website is expecting a user visiting from an iPhone via Chrome browser (user-agent info) to be accompanied by a long header that includes a bunch of parameters (header info), it’ll be checking your request for that section and that match. It will look suspicious if that section is absent or shorter than it should be, and you can be banned.

For instance, here’s the browser compatibility table for including the Sec-CH-UA-Mobile parameter into the header.

As you can see, not every browser sends this header. But it does go with Chromium browsers, which means you’ll have to include it in your request if you want to come off as a real Chrome user. Usually, the more elaborate the browser header is, the better is your chance for flying under the radar – if you know how to put it together correctly. Our real Chrome header from the previous screenshot does include that parameter, by the way:

You can create your own collection of working header combos or turn to libraries. For instance, this is how you can send a request with a predefined header from the Puppeteer library:

await page.setExtraHTTPHeaders(< 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36', 'upgrade-insecure-requests': '1', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'en-US,en;q=0.9,en;q=0.8' >) await page.goto('. ') 

By adding an appropriate, expected request header, we’ve reduced the chances of being blocked by the website and getting the dreaded 403. Our request now comes off as from a real user using a real browser, and since most websites are quite careful about keeping human traffic, it gets a pass from the bot-blocking measures.

However, if the 403 error keeps showing up, we have to move on to level 3 issues: IP-rate limiting. Maybe you’ve overdone it with the number of requests to the website, and it flagged your IP address as one belonging to a bot. What now?

Rotating proxies

If the two options above have stopped working for your case, and the 403 error keeps showing up, it’s time to consider the last scenario: your IP address might be getting blocked. This scenario is quite reasonable if you see it from the website’s point of view: it just got a large number of requests coming from the same IP address within an unnaturally short amount of time. Now it could be the case of many people sharing the same IP address by using the same WiFi, but they wouldn’t normally all go on the same website at the same time, right?

The most logical way to approach it is to use a proxy. There are many proxy providers out there, some paid, some free, and varying in type. There are even tools, such as our free Proxy Scraper, that can find free working public proxies for you. But the main point is proxies secure your request with a different IP address every time you access the website you’re trying to scrape. That way, the website will perceive it as coming from different users.

You can use an existing proxy pool or create one of your own, like in the example below.

const < gotScraping >= require('got-scraping'); const proxyUrls = [ 'http://usernamed:password@myproxy1.com:1234', 'http://usernamed:password@myproxy2.com:1234', 'http://usernamed:password@myproxy4.com:1234', 'http://usernamed:password@myproxy5.com:1234', 'http://usernamed:password@myproxy6.com:1234', ]; proxyUrls.forEach(proxyUrl => < gotScraping .get(< url: 'https://apify.com', proxyUrl, >) .then((< body >) => console.log(body)) >); 

Don’t forget to combine the proxy method with the previous ones you’ve learned: user-agents and HTTP headers. That way, you can secure the chances for successful large-scale scraping without breaching any of the website’s bot-blocking rules.

Learn more about using proxies and how to fly under the radar with Apify Academy’s anti-scraping mitigation techniques

Skip the hard part

You’re probably not the first one and not the last one to deal with the website throwing you errors when scraping. Tinkering with your request from various angles can take a lot of time and trial-and-error. Luckily, there are open-source libraries such as Crawlee �� built specifically to tackle those issues.

This library has a smart proxy pool that rotates IP addresses for you intelligently by picking an unused address from the existing number of reliable proxy addresses. In addition, it pairs this proxy with the best user agents and HTTP header for your case. There’s no risk of using expired proxies or even the need to save cookies and auth tokens. Crawlee makes sure they are connected with your IP address, which diminishes the chances of getting blocked, including but not limited to the 403 Forbidden error. Conveniently, there’s no need for you to create three separate workarounds for picking proxies, headers, and user agents.

This is how you can set it up:

import < BasicCrawler, ProxyConfiguration >from 'crawlee'; import < gotScraping >from 'got-scraping'; const proxyConfiguration = new ProxyConfiguration(< /* opts */ >); const crawler = new BasicCrawler(< useSessionPool: true, sessionPoolOptions: < maxPoolSize: 100 >, async requestHandler(< request, session >) < const < url >= request; const requestOptions = < url, proxyUrl: await proxyConfiguration.newUrl(session.id), throwHttpErrors: false, headers: < Cookie: session.getCookieString(url), >, >; let response; try < response = await gotScraping(requestOptions); >catch (e) < if (e === 'SomeNetworkError') < session.markBad(); >throw e; > session.retireOnBlockedStatusCodes(response.statusCode); if (response.body.blocked) < session.retire(); >session.setCookiesFromResponse(response); >, >); 

Read more about SessionPool in the Crawlee docs

You also have complete control over how you want to configure your working combination of user agents, devices, OS, browser versions, and all the browser fingerprint details.

import < PlaywrightCrawler >from 'crawlee'; const crawler = new PlaywrightCrawler(< browserPoolOptions: < useFingerprints: true, // this is the default fingerprintOptions: < fingerprintGeneratorOptions: < browsers: [< name: 'edge', minVersion: 96, >], devices: [ 'desktop', ], operatingSystems: [ 'windows', ], >, >, >, // . >); 

Facing errors is the reality of web scraping. If, after trying all these various ways (rotating user agents, headers, and proxies), and even trying Crawlee, the 403 error still doesn’t leave you alone, you can always turn to the community for guidance and some pretty accurate advice. Share your issue with us on Discord and see what our community of like-minded automation and web scraping enthusiasts has to say.

Another frequent issue is, while being focused on something very complex, you could’ve missed something very obvious. Which is why you’re invited to refresh your memory or maybe even learn something new on anti-scraping protections in Apify Academy. Best of luck!

Почему парсер выдаёт 403 даже после указания Cookie и User-Agent?

Пытался написать парсер для выгрузки себе картинок с artstation.com, взял рандомный профиль, практически весь контент там подгружается json-ом, нашёл GET запрос, в браузере он открывается норм, а через requests.get выдает 403. В гугле все советуют указать заголовок User-Agent и Cookie, использовал requests.sessions и указал User-Agent, но всё равно картина та же, ЧЯДНТ?

import requests url = 'https://www.artstation.com/users/kuvshinov_ilya' json_url = 'https://www.artstation.com/users/kuvshinov_ilya/projects.json?page=1' header = session = requests.Session() r = session.get(url, headers=header) json_r = session.get(json_url, headers=header) print(json_r) > Response [403]
  • Вопрос задан более трёх лет назад
  • 9737 просмотров

Комментировать

Решения вопроса 2

Retard Soft Inc.

Виной 403 коду является cloudflare.
Для обхода мне помог cfscrape

def get_session(): session = requests.Session() session.headers = < 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language':'ru,en-US;q=0.5', 'Accept-Encoding':'gzip, deflate, br', 'DNT':'1', 'Connection':'keep-alive', 'Upgrade-Insecure-Requests':'1', 'Pragma':'no-cache', 'Cache-Control':'no-cache'>return cfscrape.create_scraper(sess=session) session = get_session() # Дальше работать как с обычной requests.Session

Немного кода о выдёргивании прямых ссылок на хайрес пикчи:

import requests import cfscrape def get_session(): session = requests.Session() session.headers = < 'Host':'www.artstation.com', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language':'ru,en-US;q=0.5', 'Accept-Encoding':'gzip, deflate, br', 'DNT':'1', 'Connection':'keep-alive', 'Upgrade-Insecure-Requests':'1', 'Pragma':'no-cache', 'Cache-Control':'no-cache'>return cfscrape.create_scraper(sess=session) def artstation(): url = 'https://www.artstation.com/kyuyongeom' page_url = 'https://www.artstation.com/users/kyuyongeom/projects.json' post_pattern = 'https://www.artstation.com/projects/<>.json' session = get_session() absolute_links = [] response = session.get(page_url, params=).json() pages, modulo = divmod(response['total_count'], 50) if modulo: pages += 1 for page in range(1, pages+1): if page != 1: response = session.get(page_url, params=).json() for post in response['data']: shortcode = post['permalink'].split('/')[-1] inner_resp = session.get(post_pattern.format(shortcode)).json() for img in inner_resp['assets']: if img['asset_type'] == 'image': absolute_links.append(img['image_url']) with open('links.txt', 'w') as file: file.write('\n'.join(absolute_links)) if __name__ == '__main__': artstation()

Ответ написан более трёх лет назад

Нравится 3 2 комментария

How To Solve 403 Forbidden Errors When Web Scraping

Getting a HTTP 403 Forbidden Error when web scraping or crawling is one of the most common HTTP errors you will get.

Often there are only two possible causes:

  • The URL you are trying to scrape is forbidden, and you need to be authorised to access it.
  • The website detects that you are scraper and returns a 403 Forbidden HTTP Status Code as a ban page.

Most of the time it is the second cause, i.e. the website is blocking your requests because it thinks you are a scraper.

403 Forbidden Errors are common when you are trying to scrape websites protected by Cloudflare, as Cloudflare returns a 403 status code.

In this guide we will walk you through how to debug 403 Forbidden Error and provide solutions that you can implement.

  • Easy Way To Solve 403 Forbidden Errors When Web Scraping
  • Use Fake User Agents
  • Optimize Request Headers
  • Use Rotating Proxies

Easy Way To Solve 403 Forbidden Errors When Web Scraping​

If the URL you are trying to scrape is normally accessible, but you are getting 403 Forbidden Errors then it is likely that the website is flagging your spider as a scraper and blocking your requests.

To avoid getting detected we need to optimise our spiders to bypass anti-bot countermeasures by:

  • Using Fake User Agents
  • Optimizing Request Headers
  • Using Proxies

We will discuss these below, however, the easiest way to fix this problem is to use a smart proxy solution like the ScrapeOps Proxy Aggregator.

ScrapeOps Proxy Aggregator

With the ScrapeOps Proxy Aggregator you simply need to send your requests to the ScrapeOps proxy endpoint and our Proxy Aggregator will optimise your request with the best user-agent, header and proxy configuration to ensure you don’t get 403 errors from your target website.

Simply get your free API key by signing up for a free account here and edit your scraper as follows:

  import requests API_KEY = 'YOUR_API_KEY'  def get_scrapeops_url(url):  payload = 'api_key': API_KEY, 'url': url>  proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload) return proxy_url r = requests.get(get_scrapeops_url('http://quotes.toscrape.com/page/1/')) print(r.text) 

If you are getting blocked by Cloudflare, then you can simply activate ScrapeOps’ Cloudflare Bypass by adding bypass=cloudflare to the request:

  import requests API_KEY = 'YOUR_API_KEY'  def get_scrapeops_url(url):  payload = 'api_key': API_KEY, 'url': url, 'bypass': 'cloudflare'>  proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload) return proxy_url r = requests.get(get_scrapeops_url('http://example.com/')) print(r.text) 

Or if you would prefer to try to optimize your user-agent, headers and proxy configuration yourself then read on and we will explain how to do it.

Use Fake User Agents​

The most common reason for a website to block a web scraper and return a 403 error is because you is telling the website you are a scraper in the user-agents you send to the website when making your requests.

By default, most HTTP libraries (Python Requests, Scrapy, NodeJs Axios, etc.) either don’t attach real browser headers to your requests or include headers that identify the library that is being used. Both of which immediately tell the website you are trying to scrape that you are scraper, not a real user.

For example, let’s send a request to http://httpbin.org/headers with the Python Requests library using the default setting:

  import requests r = requests.get('http://httpbin.org/headers') print(r.text) 

You will get a response like this that shows what headers we sent to the website:

    "headers":   "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.26.0", > > 

Here we can see that our request using the Python Requests libary appends very few headers to the request, and even identifies itself as the python requests library in the User-Agent header.

  "User-Agent": "python-requests/2.26.0", 

This tells the website that your requests are coming from a scraper, so it is very easy for them to block your requests and return a 403 status code.

Solution​

The solution to this problem is to configure your scraper to send a fake user-agent with every request. This way it is harder for the website to tell if your requests are coming from a scraper or a real user.

Here is how you would send a fake user agent when making a request with Python Requests.

  import requests HEADERS = 'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'> r = requests.get('http://quotes.toscrape.com/page/1/', headers=HEADERS) print(r.text) 

Here we are making our request look like it is coming from a iPad, which will increase the chances of the request getting through.

This will only work on relatively small scrapes, as if you use the same user-agent on every single request then a website with a more sophisticated anti-bot solution could easily still detect your scraper.

To solve when scraping at scale, we need to maintain a large list of user-agents and pick a different one for each request.

  import requests import random user_agents_list = [ 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36' ] r = requests.get('http://quotes.toscrape.com/page/1/', headers='User-Agent': random.choice(user_agents_list)>) print(r.text) 

Now, when we make the request. We will pick a random user-agent for each request.

Optimize Request Headers​

In a lot of cases, just adding fake user-agents to your requests will solve the 403 Forbidden Error, however, if the website is has a more sophisticated anti-bot detection system in place you will also need to optimize the request headers.

By default, most HTTP clients will only send basic request headers along with your requests such as Accept , Accept-Language , and User-Agent .

 Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' Accept-Language: 'en' User-Agent: 'python-requests/2.26.0' 

In contrast, here are the request headers a Chrome browser running on a MacOS machine would send:

 Connection: 'keep-alive' Cache-Control: 'max-age=0' sec-ch-ua: '" Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"' sec-ch-ua-mobile: '?0' sec-ch-ua-platform: "macOS" Upgrade-Insecure-Requests: 1 User-Agent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36' Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' Sec-Fetch-Site: 'none' Sec-Fetch-Mode: 'navigate' Sec-Fetch-User: '?1' Sec-Fetch-Dest: 'document' Accept-Encoding: 'gzip, deflate, br' Accept-Language: 'en-GB,en-US;q=0.9,en;q=0.8' 

If the website is really trying to prevent web scrapers from accessing their content, then they will be analysing the request headers to make sure that the other headers match the user-agent you set, and that the request includes other common headers a real browser would send.

Solution​

To solve this, we need to make sure we optimize the request headers, including making sure the fake user-agent is consistent with the other headers.

This is a big topic, so if you would like to learn more about header optimization then check out our guide to header optimization.

However, to summarize, we don’t just want to send a fake user-agent when making a request but the full set of headers web browsers normally send when visiting websites.

Here is a quick example of adding optimized headers to our requests:

  import requests HEADERS =   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", "Sec-Fetch-Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "none", "Sec-Fetch-User": "?1", "Cache-Control": "max-age=0", > r = requests.get('http://quotes.toscrape.com/page/1/', headers=HEADERS) print(r.text) 

Here we are adding the same optimized header with a fake user-agent to every request. However, when scraping at scale you will need a list of these optimized headers and rotate through them.

Use Rotating Proxies​

If the above solutions don’t work then it is highly likely that the server has flagged your IP address as being used by a scraper and is either throttling your requests or completely blocking them.

This is especially likely if you are scraping at larger volumes, as it is easy for websites to detect scrapers if they are getting an unnaturally large amount of requests from the same IP address.

Solution​

You will need to send your requests through a rotating proxy pool.

Here is how you could do it Python Requests:

  import requests from itertools import cycle list_proxy = [ 'http://Username:Password@IP1:20000', 'http://Username:Password@IP2:20000', 'http://Username:Password@IP3:20000', 'http://Username:Password@IP4:20000', ] proxy_cycle = cycle(list_proxy) proxy = next(proxy_cycle)  for i in range(1, 10):  proxy = next(proxy_cycle) print(proxy)  proxies =   "http": proxy, "https":proxy >  r = requests.get(url='http://quotes.toscrape.com/page/1/', proxies=proxies) print(r.text) 

Now, your request will be routed through a different proxy with each request.

You will also need to incorporate the rotating user-agents we showed previous as otherwise, even when we use a proxy we will still be telling the website that our requests are from a scraper, not a real user.

If you need help finding the best & cheapest proxies for your particular use case then check out our proxy comparison tool here.

Alternatively, you could just use the ScrapeOps Proxy Aggregator as we discussed previously.

More Web Scraping Tutorials​

So that’s how you can solve 403 Forbidden Errors when you get them.

If you would like to know more about bypassing the most common anti-bots then check out our bypass guides here:

  • How To Bypass Cloudflare
  • How To Bypass PerimeterX

Or if you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.

Or check out one of our more in-depth guides:

  • How to Scrape The Web Without Getting Blocked Guide
  • The State of Web Scraping 2020
  • The Ethics of Web Scraping
  • Easy Way To Solve 403 Forbidden Errors When Web Scraping
  • Use Fake User Agents
    • Solution
    • Solution
    • Solution

    Saved searches

    Use saved searches to filter your results more quickly

    Cancel Create saved search

    You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

    psf / requests Public

    Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

    By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

    Already on GitHub? Sign in to your account

    requests.get with stream=True never returns on 403 #6376

    mattpr opened this issue Mar 9, 2023 · 6 comments

    requests.get with stream=True never returns on 403 #6376

    mattpr opened this issue Mar 9, 2023 · 6 comments

    Comments

    mattpr commented Mar 9, 2023

    I have a case where a url I was using requests to fetch (image so using stream=True to download to local file) started returning 403 errors with some html. and some very old code stopped working. The 403 isn’t the problem. The issue is a requests call that hangs. We didn’t notice for a while because it didn’t crash/error, it just hung. Apparently for a couple weeks (need more monitoring coverage for other signals there).

    Running it manually it also hangs for at least a few minutes (as long as I waited). This shouldn’t be a timeout case anyway as the server responds right away with the 403.

    If I manipulate headers (adding headers=. to the requests.get( call), I can make the 403 go away and the code runs fine again. But that isn’t a solution. The issue for me here is the hang because I can’t handle or report there is an issue (e.g. getting a 403).

    Looking through the docs, I don’t see anything about this behaviour but I might have missed it. Any idea what I missed?

    # python3 --version Python 3.8.10 # pip3 list | grep requests requests 2.28.2 

    curl of offending url (redacted)

    # curl -s -D - https://example.com/some/path/file.jpg HTTP/2 403 mime-version: 1.0 content-type: text/html content-length: 310 expires: Thu, 09 Mar 2023 12:22:13 GMT cache-control: max-age=0, no-cache pragma: no-cache date: Thu, 09 Mar 2023 12:22:13 GMT server-timing: cdn-cache; desc=HIT server-timing: edge; dur=1 server-timing: ak_p; desc="466212_388605830_176004165_24_6476_12_0";dur=1  Access Denied  

    Access Denied

    You don't have permission to access "XXXXXXXXXX" on this server. XXXXX

    Excerpt of the hanging code. res = requests.get(url, stream=True) never returns.

    local_file = "/tmp/file.jpg" url = "https://example.com/some/path/file.jpg" res = requests.get(url, stream=True) # print("I never print.") if res.status_code == 200: try: with open(local_file, 'wb') as fd: for chunk in res.iter_content(chunk_size=128): fd.write(chunk) except EnvironmentError as e: print("Received error when attempting to download .".format(url)) print(e) return False return True else: print("Received status when attempting to download .".format(res.status_code, url)) return False

    The text was updated successfully, but these errors were encountered:

    mattpr commented Mar 9, 2023

    Good point. I let it hang and then aborted. Here is the relevant stack. appears to be related to ssl?

    When I add some http headers to the request the 403 goes away and the request.get works.

    So unless the server is doing some conditional ssl/tls stuff based on HTTP (doesn’t make any sense to me as http happens after tls), I’m not sure what is up there.

    In any case, I would expect an ssl problem (timeout, hangup, whatever) to be surfaced. but maybe the problem is in python rather than requests.

     ^CTraceback (most recent call last): . File "/opt/script.py", line 96, in downloadImageToFile res = requests.get(url, stream=True) File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 73, in get return request("get", url, params=params, **kwargs) File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 489, in send resp = conn.urlopen( File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/usr/lib/python3.8/http/client.py", line 1348, in getresponse response.begin() File "/usr/lib/python3.8/http/client.py", line 316, in begin version, status, reason = self._read_status() File "/usr/lib/python3.8/http/client.py", line 277, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/lib/python3.8/socket.py", line 669, in readinto return self._sock.recv_into(b) File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into return self.read(nbytes, buffer) File "/usr/lib/python3.8/ssl.py", line 1099, in read return self._sslobj.read(len, buffer) KeyboardInterrupt 

    Contributor
    sigmavirus24 commented Mar 9, 2023

    So we’re hanging trying to read the very first line which would be the HTTP version Status code Status reason information.

    What headers are you adding?

    Many servers have started blocking requests user agent strings because of abusive and malicious actors using requests. It’s possible this server thinks you’re acting maliciously and doing something similar.

    This could be solved with a default timeout which we have an open issue for, but you can also set a read timeout to do the same thing yourself.

    I don’t believe this is a bug we can fix (see also discussion on the timeout issue) in the near term.

    mattpr commented Mar 9, 2023

    headers = < 'accept': 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8', 'accept-language': 'en-US,en;q=0.9', 'referer': 'https://example.com/path/to/file/index.html', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36', >

    Many servers have started blocking requests user agent strings because of abusive and malicious actors using requests. It’s possible this server thinks you’re acting maliciously and doing something similar.

    Understandable. I expect that is why I get the 403 (visible in curl. goes away once I add headers. including user-agent). user-agent and referer are the only headers they might be switching on (whether I get a 403 or not).

    But SSL/TLS breaking is weird as the only http-related stuff available at TLS handshake is the server-name-indication which is just the requested hostname. no http headers. So to break ssl it would have to allow ssl until the http request was received/parsed and then maybe from the http-layer just orphan off the request without responding. but I would expect their webserver to have some upstream timeout or something at some point. I can’t imagine their server left the tcp connection open for weeks with no traffic. So at some point the tcp connection should have timed out or got a FIN from their end at which point the connection has failed and some kind of error should surface somewhere on the requesting client’s side.

    As it stands now, it looks like we were sitting there for 2 weeks waiting for a response without the tcp connection closing or timing out. I can’t imagine that is what is actually happening, but maybe it is and a timeout is the remedy.

    Plus this is another level beyond the 403 which isn’t super logical to me.

    • 200 — okay (with user-agent and referer)
    • 403 — no auth (curl without user-agent and referer)
    • tls/ssl hang — (requests without user-agent and referer)

    I did my tcpdump troubleshooting already for the month but maybe if I get fired up I’ll do some more on this topic in the interest of getting the underlying issue identified (e.g. is a FIN being ignored or is the tcp connection really staying open this long?).

    This could be solved with a default timeout which we have an open issue for, but you can also set a read timeout to do the same thing yourself.

    I will try to work around this by specifying a timeout so at least we don’t hang without failing for weeks on end. but I suspect there is something that should be fixed here (although it might be in urllib3 or python’s http, socket or ssl).

    Just for kicks I thought I’d try again with curl using requests’ user-agent to see if I could get the server to not respond. Still get the 403.

    # curl -s -D - -H 'User-Agent: python-requests/2.28.2' https://example.com/some/path/file.jpg HTTP/2 403 mime-version: 1.0 content-type: text/html content-length: 310 expires: Thu, 09 Mar 2023 16:29:48 GMT cache-control: max-age=0, no-cache pragma: no-cache date: Thu, 09 Mar 2023 16:29:48 GMT server-timing: cdn-cache; desc=HIT server-timing: edge; dur=1 server-timing: ak_p; desc="466216_388605857_42742355_25_6766_11_0";dur=1  Access Denied  

    Access Denied

    You don't have permission to access "XXXXXXX" on this server. XXXXXXXXXX

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *