在 Katana CLI 批处理中将发现的 URL 映射到原始 URL 时出现问题

我使用 Katana CLI 进行网络爬行，并使用 Python 包装器来管理批处理和输出解析。我的目标是将所有发现的 URL 映射回其原始 URL，但我面临着一些发现的 URL 无法正确映射的问题，特别是当域相似或涉及子域时。

以下是我的设置：| ||输入：powerui.foo.com、acnmll-en.foo.com

class KatanaData:
    def __init__(self, domain: str, original_url: str, id: str):
        self._original_url = original_url
        self._domain = domain
        self._id = id
        self._discovered_urls = []
        self._error = None
        self._processing_time = None

def run_katana(batch: Dict, timeout=120):
    url_list = create_url_list(batch)

    tmp = 'tmp'
    output_dir = f'{tmp}/SRD_output'
    
    cmd = [
        'katana', '-u', url_list, '-headless', '-headless-options', '--disable-gpu', '-field-scope', 'dn', '-depth', '5',
         '-extension-filter', 'css', '-timeout', '10', '-crawl-duration', f'{timeout}s', '-srd', output_dir
    ]
    error_message = None
    start_time = time.time()
    process = None
    try:
        process = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, timeout=timeout)
        processing_time = round(time.time() - start_time, 2)

        if process.returncode == 0:

        else:
            log_with_location(f'Batch for {url_list} failed in {processing_time}')
            error_message = process.stderr.strip()
    except subprocess.TimeoutExpired:
        if process:
            process.kill()
        log_with_location(f'Batch for {url_list} timed out after {timeout} seconds')
        error_message = "Process timed out"
    finally:
        try:

            parse_srd(output_dir, batch)
        except (FileNotFoundError, ValueError) as e:
            error_message = str(e)
        kill_katana_processes()
        cleanup_temp(tmp)
    return error_message

def parse_srd(output_dir, batch: Dict):
    log_with_location(f'Starting parse_srd for {len(batch)} urls')
    output_file = f'{output_dir}/index.txt'
    if not os.path.exists(output_file):
        log_with_location(f'{output_file} not found in {output_dir}', logging.ERROR)
        raise FileNotFoundError(f'{output_file} not found')

    with open(output_file, 'r') as file:
        lines = file.readlines()
        for line in lines:
            parts = line.split()
            if len(parts) >= 3:
                file_path = parts[0]
                discovered_url = parts[1]
                status = parts[2]
                original_url = find_original_url(discovered_url, batch)
                if original_url:
                    batch[original_url]._discovered_urls.append((discovered_url, status))
                else:
                    print(f"Warning: Original URL not found for discovered URL {discovered_url}")

def find_original_url(discovered_url, batch: Dict):
    discovered_netloc = urlparse(discovered_url).netloc
    for domain, katanaData in batch.items():
        original_netloc = urlparse(katanaData._original_url).netloc
        if discovered_netloc == original_netloc or discovered_netloc.endswith(f".{original_netloc}"):
            return domain
    return None

def create_url_list(batch: Dict):
    return ','.join([data._original_url for data in batch.values()])

Katana 的 -srd 标志会将 http 请求/响应存储到自定义目录。它还创建一个 index.txt 文件，其中包含三列：http 请求/响应的位置、发现的 url 和状态。

ex: tmp/SRD_output/powerui.foo.com/87ef37260d0375e204e8e16d00768920fb2bc5eb.txt

https://powerui.foo.com/powerui/vendor/bower-asset/masonry/dist/masonry.pkgd.min.js?v20240706122948 （好的） tmp/SRD_output/newsroom.foo.com/85e443613511a3a41f1d20568d5ecc8506b57e43.txt https://newsroom.foo.com/scripts/scripts.js (确定) tmp/SRD_output/investor.foo.com/78e3fd823de018fa2323b59ae7ce34b54d8e90d0.txt https://investor.foo.com/javascripts/home-banner.js?revision=d6483dcf-da80-4ab1-a5b7-a507a5dcd426 (确定) 如何确保每个发现的 URL 都正确映射回其原始 URL，尤其是在处理类似的域时？在这种情况下，是否有更好的策略来处理域匹配？有没有办法增强 find_original_url 函数以更准确地处理这些情况？

简单域匹配 - 使用精确域匹配或检查发现的域是否以原始域结尾。子域处理 - 考虑子域，但当域相似时仍然面临问题。 JSONL 标志 - 以 jsonl 格式输出，有时我可以在其中看到源 URL。对于每个 jsonl 行，响应可以是有效的，也可以是错误的。一次处理一个 - 这非常慢，我想利用该工具具有的并发性和并行性。

警告：未找到已发现 URL 的原始 URL

https://newsroom.foo.com/scripts /scripts.js 警告：未找到已发现 URL 的原始 URL https://investor.foo.com/javascripts/home-banner.js?revision=d6483dcf-da80-4ab1-a5b7-a507a5dcd426|| |问题是某些发现的 URL 无法正确映射回原始 URL，尤其是当域相似时。例如，如果我有两个原始 URL，其域分别为 acnml-en.foo 和 powerui.foo，则发现的来自子域或类似域的 URL 有时无法正确映射。 https://investor.foo.com/javascripts/home-banner.js?revision=d6483dcf-da80-4ab1-a5b7-a507a5dcd426

The issue is that some discovered URLs do not map back to the original URLs correctly, especially when domains are similar. For example, if I have two original URLs with domains of acnml-en.foo and powerui.foo, discovered URLs from subdomains or similar domains are sometimes not mapped correctly.

遇到的問題很常見，因為依賴簡單的字串匹配來找出原始 URL 可能不可靠，尤其是在處理複雜的網站結構和子網域時。的 find_original_url 函式目前會檢查完全相符或子網域包含，這可能會導致當多個原始 URL 共用通用子網域時出現錯誤分類。

以下是如何增強的方法以提高準確性：

使用 Katana 的 -jsonl 標誌： 正如所提到的，Katana 可以輸出為 JSONL 格式，其中包含更多資訊，包括參考者標頭，這在確定原始 URL 時非常有用。修改的 run_katana 函式以包含 -jsonl 標誌，並建立一個新目錄來儲存輸出： python cmd = [ 'katana', '-u', url_list, '-headless', '-headless-options', '--disable-gpu', '-field-scope', 'dn', '-depth', '5', '-extension-filter', 'css', '-timeout', '10', '-crawl-duration', f'{timeout}s', '-srd', output_dir, '-jsonl', output_dir ]
從 JSONL 輸出中解析參考者： 更新的 parse_srd 函式以解析 JSONL 檔案，並使用 referrer 資訊來準確地找出原始 URL。

```python import json

def parse_srd(output_dir, batch: Dict): log_with_location(f'開始針對 {len(batch)} 個 URL 剖析 srd') for filename in os.listdir(output_dir): if filename.endswith(".jsonl"): with open(os.path.join(output_dir, filename), 'r') as f: for line in f: data = json.loads(line) discovered_url = data.get('url') status = data.get('status') referrer = data.get('referrer')

                 if discovered_url and referrer: 
                     original_url = find_original_url(discovered_url, referrer, batch)
                     if original_url:
                         batch[original_url]._discovered_urls.append((discovered_url, status))
                     else:
                         print(f"警告：找不到已發現 URL 的原始 URL {discovered_url}")

def find_original_url(discovered_url, referrer, batch: Dict): for original_url, katanaData in batch.items(): if referrer.startswith(original_url): return original_url return None ```

處理邊緣案例： 雖然參考者方法應該可以更準確地解決大多數情況，但可能仍然會遇到某些邊緣案例。考慮為無法使用參考者解析的 URL 實作額外的邏輯。例如，可以：
- 放寬網域名稱匹配： 允許與原始 URL 的網域名稱進行部分匹配，例如使用 urllib.parse.urlparse 提取主機名並檢查相似性。
- 使用 URL 路徑相似性： 如果網域名稱匹配不明確，請比較已發現 URL 與每個原始 URL 的路徑結構，以找出最相似的匹配。
- 維護已發現 URL 的記錄： 在 Katana 執行期間，追蹤每個原始 URL 發現的 URL。如果在使用其他方法後 URL 仍然不明確，可以使用此記錄嘗試推斷原始 URL。

通過使用這些增強功能，的腳本將能夠更準確可靠地將已發現的 URL 映射回其原始 URL，即使在處理複雜的網站和子網域時也是如此。

請記住，即使有這些改進，完美地將每個 URL 映射回其確切來源仍然是一項挑戰，尤其是在處理複雜的網站架構和客戶端端重定向時。持續測試和改進的邏輯對於盡可能提高準確性至關重要。

标签：python,web-crawler,katana
From： 78785687

在 Katana CLI 批处理中将发现的 URL 映射到原始 URL 时出现问题

相关文章

赞助商

阅读排行