使用 Playwright 查找并单击具有动态 CSS 的按钮(抓取 TikTok)

问题描述 投票:0回答:1

我正在为各种社交媒体平台开发 OSS 抓取工具。我对 TikTok 有一个小问题。我可以成功抓取配置文件并获取元数据。但是,我还想撤回与该个人资料关联的视频的元数据。视频信息包含在 XHR 调用中。

但是,加载页面后,会出现登录模式。我发现,如果单击

Continue as guest
按钮,模式就会消失并执行 XHR 请求。让事情变得困难的是,TikTok 使用生成的 CSS 样式作为按钮。

我已经让它工作了:

page.click('.css-dcgpa6-DivBoxContainer');
但是,标识符每隔几分钟就会更改一次。

所以我的问题是,有没有办法:

  1. 通过使用按钮包含已知文本的事实来查找按钮的 css 类?然后:
  2. 使用剧作家点击此按钮?

这是我的代码:

    def collect(self, username: str) -> dict:
        _xhr_calls = []
        final_url = f"{TIKTOK_BASE_URL}{username}"

        def intercept_response(response):
            """Capture all background requests and save them."""
            # We can extract details from background requests
            if response.request.resource_type == "xhr":
                logging.debug(f"Appending {response.request.url}")
                _xhr_calls.append(response)
            return response

        with sync_playwright() as pw_firefox:
            browser = pw_firefox.firefox.launch(headless=True, timeout=self.timeout)
            context = browser.new_context(viewport={"width": 1920, "height": 1080},
                                          strict_selectors=False)
            page = context.new_page()

            # Block cruft
            page.route("**/*", AsyncUtils.intercept_route)

            # Enable background request intercepting:
            page.on("response", intercept_response)

            # Navigate to the profile page
            page.goto(final_url, referer=final_url)
            page.wait_for_timeout(1500)
            # Get the page content
            html = page.content()

            # Parse it.
            soup = BeautifulSoup(html, 'html.parser')

            # The user info is contained in a large JS object called __UNIVERSAL_DATA_FOR_REHYDRATION__.
            tt_script = soup.find('script', attrs={'id': "__UNIVERSAL_DATA_FOR_REHYDRATION__"})

            try:
                raw_json = json.loads(tt_script.string)
            except AttributeError as exc:
                raise JSONDecodeError(
                    f"ScrapeOMatic was unable to parse the data from TikTok user {username}. Please try again.\n {exc}") from exc

            user_data = raw_json['__DEFAULT_SCOPE__']['webapp.user-detail']['userInfo']['user']
            stats_data = raw_json['__DEFAULT_SCOPE__']['webapp.user-detail']['userInfo']['stats']

            """
            button = page.get_by_text('p:has-text("Continue as guest")')
            guest_button = page.locator(selector="div", has=button)
            if guest_button is not None:
                logging.debug("Clicking button.")
                guest_button.click(no_wait_after=True)

            # page.click('.css-dcgpa6-DivBoxContainer');
            # page.click('.emuynwa3');
            # page.wait_for_timeout(500)
            # page.keyboard.press("PageDown")
            # page.wait_for_timeout(500)
            # page.keyboard.press("PageDown")
            """

            data_calls = [f for f in _xhr_calls if "list" in f.url]
            for call in data_calls:
                logging.debug(call.json())

            profile_data = {
                'sec_id': user_data['secUid'],
                'id': user_data['id'],
                'is_secret': user_data['secret'],
                'username': user_data['uniqueId'],
                'bio': emoji.demojize(user_data['signature'], delimiters=("", "")),
                'avatar_image': user_data['avatarMedium'],
                'following': stats_data['followingCount'],
                'followers': stats_data['followerCount'],
                'language': user_data['language'],
                'nickname': emoji.demojize(user_data['nickname'], delimiters=("", "")),
                'hearts': stats_data['heart'],
                'region': user_data['region'],
                'verified': user_data['verified'],
                'heart_count': stats_data['heartCount'],
                'video_count': stats_data['videoCount'],
                'is_verified': user_data['verified'],
                # 'videos': videos,
                # 'hashtags': self.hashtags
            }

            return profile_data

任何帮助将不胜感激。这里还有 GitHub 存储库的链接:https://github.com/geniza-ai/scrapeomatic

谢谢!!

python playwright playback playwright-python tiktok
1个回答
0
投票

虽然我不是 python 开发人员(并且不会因为试图抛弃我旧的 python 知识而感到羞耻),但我可以提供一种伪代码式的方式来实现你所需要的。

您的问题说:

  1. 通过使用按钮包含已知文本的事实来查找按钮的 css 类?然后:

2.使用剧作家点击此按钮?

如果您已经知道按钮文本,则不需要它的类。您有文本,因此可以通过此标识符找到它:

`await Expect(page.get_by_text("继续作为访客")).to_be_visible()'

然后点击它即可。

© www.soinside.com 2019 - 2024. All rights reserved.