python如何爬取天猫店铺商品链接？

762次阅读

python如何爬取天猫店铺商品链接？

在使用python爬虫爬取网页时会遇到很多含有特殊符号的情况，当把链接复制到浏览器打开，发现每个节点都多了个\，直接使用response.xpath()无法定位元素，为避免定位不到元素的问题，应先对响应内容做一下过滤，然后使用response.replace()将过滤后的html文档重新赋值给response，本文以爬取天猫店铺商品链接为例，向大家介绍爬取过程。

爬取思路

1、使用response.text获取html文本，去除其中的\；

2、使用response.replace() 重新将去除\后的html赋值给response；

3、使用response.xpath()定位元素，成功获取商品链接。

具体代码

# -*- coding: utf-8 -*-
import re
import scrapy


class TmallSpider(scrapy.Spider):
    name = 'tmall'
    allowed_domains = ['tmall.com']
    start_urls = [
        'https://wogeliou.tmall.com/i/asynSearch.htm?_ksTS=1611910763284_313&callback=
        jsonp314&mid=w-22633333039-0&wid=22633333039&path=/search.htm&search=y&spm=a220o.1000855.0.0.7fcc367fsdZyLF'
    ]

    custom_settings = {
        'ITEM_PIPELINES': {
            'tn_scrapy.pipelines.TnScrapyPipeline': 300,
        },
        'DEFAULT_REQUEST_HEADERS': {
            "user-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) 
            Chrome/78.0.3904.70 Safari/537.36',
            'cookie': '登录后的cookie'
        }
    }

    def parse(self, response):
        html = re.findall(r'jsonp\d+\("(.*?)"\)', response.text)[0]
        # 替换掉 \
        html = re.sub(r'\\', '', html)
        # print('html:', html)
        response = response.replace(body=html)
        link = response.xpath('//div[@class="item5line1"]/dl/dd[@class="detail"]/a/@href').extract()
        print('link: ', link)

以上就是python爬取天猫店铺商品链接的介绍，大家可以套入代码直接使用哦~

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2021-07-18

复制链接

赏

python如何爬取天猫店铺商品链接？

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置