使用 powershell 或 python 从网页列出公司名称

标签：python powershell selenium-webdriver web-scraping

我希望使用 PowerShell 或 python 仅列出 URL 中的公司名称：

https://www.moneycontrol.com/markets/earnings/results-calendar/?activeDate=2024-07-29

下面是我的 python 脚本用于获取网页的结构：

import requests
from bs4 import BeautifulSoup

# URL of the page
url = "https://www.moneycontrol.com/markets/earnings/results-calendar/?activeDate=2024-07-29"

# Fetch the page content
response = requests.get(url)
print(f"Status Code: {response.status_code}")

soup = BeautifulSoup(response.content, 'html.parser')

# Print the first 1000 characters of the HTML to check if we're getting content
print(soup.prettify()[:1000])

# Try different selectors
selectors = [
    'div.PA10',  # Example selector, replace with potential correct ones
    'table.mctable1',
    'td.PR10.PT5.PB5',
    # Add more potential selectors here
]

for selector in selectors:
    elements = soup.select(selector)
    print(f"\nTrying selector: {selector}")
    print(f"Found {len(elements)} elements")
    for element in elements[:5]:  # Print first 5 elements for each selector
        print(element.text.strip())

# If still no results, print all unique tag names in the HTML
print("\nAll unique tags in the HTML:")
print(set([tag.name for tag in soup.find_all()]))

输出：

PS C:\AMD> python getcompany.py
Status Code: 200
<!DOCTYPE html>
<html lang="en">
 <head>
  <link as="style" href="https://accounts.moneycontrol.com/assets/css/mclogin/bootstrap.min.css" rel="preload"/>
  <link as="style" href="https://stat2.moneycontrol.com/mccss/headfoot/mc_header.css?v=1.11" rel="preload"/>
  <meta charset="utf-8"/>
  <title>
   Results Calendar: Company Results Calendar, Quarterly Results Calendar and BSE NSE Results Calendar | Moneycontrol
  </title>
  <meta content="Results Calendar: Check quarterly results calendar of BSE NSE listed companies by the Moneycontrol. Get results announcements date of all the listed stocks and shares, stocks earnings calendar, Stocks results calendar, earnings date calendar, earnings result date, announcements, news, and more" name="description"/>
  <meta content="Results Calendar, Earnings calendar, listed company results calendar, Quarterly results calendar, Company results dates list, Company Quarterly Earnings Calendar, Company Quarterly Results Calendar, Company Quarterly Earn

我无法得出 Company Name 的列名称。但是，尝试了以下 python，它无法获取所需的公司名称列表：

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
import random

def random_sleep(min_seconds, max_seconds):
    time.sleep(random.uniform(min_seconds, max_seconds))

# Setup Chrome options
chrome_options = Options()
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)

# Setup the Chrome driver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)

# Overwrite the navigator.webdriver property
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

# URL of the page
url = "https://www.moneycontrol.com/markets/earnings/results-calendar/?activeDate=2024-08-01"

try:
    # Load the page
    driver.get(url)
    print("Page loaded successfully")

    # Random sleep to mimic human behavior
    random_sleep(5, 10)

    # Wait for the table to be present in the DOM
    wait = WebDriverWait(driver, 30)
    wait.until(EC.presence_of_element_located((By.CLASS_NAME, "mctable1")))

    # Get the page source
    page_source = driver.page_source

    # Parse the page source with BeautifulSoup
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find the table
    table = soup.find('table', class_='mctable1')

    if table:
        # Find all rows in the table
        rows = table.find_all('tr')

        # Extract and print company names
        for row in rows[1:]:  # Skip header row
            cells = row.find_all('td')
            if len(cells) > 1:
                company_name = cells[0].text.strip()
                print(company_name)
    else:
        print("Table not found in the page source")

except Exception as e:
    print(f"An error occurred: {str(e)}")

finally:
    # Close the browser
    driver.quit()

您能否建议使用 powershell 或 python 或任何其他语言的解决方案？

当然，问题似乎出在尝试访问动态加载的内容上。也就是说，网站的初始 HTML 不包含公司名称列表，而是在页面加载后通过 JavaScript 填充。

以下是如何使用 Python 和 requests_html 库来解决此问题的方法，该库具有渲染 JavaScript 的功能：

from requests_html import HTMLSession
from bs4 import BeautifulSoup

url = "https://www.moneycontrol.com/markets/earnings/results-calendar/?activeDate=2024-08-01"

session = HTMLSession()
response = session.get(url)

# 这是关键部分：使用 JavaScript 渲染页面
response.html.render()

soup = BeautifulSoup(response.html.html, 'html.parser')

table = soup.find('table', class_='mctable1')

if table:
    rows = table.find_all('tr')
    for row in rows[1:]:
        cells = row.find_all('td')
        if len(cells) > 1:
            company_name = cells[0].text.strip()
            print(company_name)
else:
    print("在页面源代码中未找到表格")

此代码执行以下操作：

使用 requests_html 获取页面： 这与 requests 类似，但 requests_html 允许 JavaScript 渲染。
渲染页面： response.html.render() 行执行 JavaScript 并更新页面内容。
解析 HTML： 使用 BeautifulSoup 解析生成的 HTML，就像之前一样。
提取公司名称： 逻辑保持不变；从表中提取公司名称。

此方法应该可以解决该问题，并允许从动态加载的表格中检索公司名称。

如果遇到与 requests_html 或任何特定库相关的其他问题，请确保已安装它 ( pip install requests_html ) 并且的环境已正确设置。

标签：python,powershell,selenium-webdriver,web-scraping
From： 78805804

使用 powershell 或 python 从网页列出公司名称

相关文章

赞助商

阅读排行