参考链接:click here
本次assignment所用的jupyter book为a2_datarep.ipynb:
{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "dcMf4aubeMI9" }, "source": [ "*****************************************************************\n", "# The Social Web: data representation\n", "- Instructors: Jacco van Ossenbruggen.\n", "- TAs: Ayesha Noorain, Alex Boyko, Caio Silva, Elena Beretta, Mirthe Dankloff.\n", "- Exercises for Hands-on session 2\n", "*****************************************************************" ] }, { "cell_type": "markdown", "metadata": { "id": "Zhts5HMzeMI-" }, "source": [ "In this session you are going to mine data in various microformats. You will see the differences in what each of the formats can contain and what purpose they serve. We will start by looking at geographical data.\n", "\n", "Prerequisites:\n", "- Python 3.8\n", "- Python packages: requests, BeautifulSoup4, HTMLParser, rdflib\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "6f-OtFPPeMJA", "outputId": "9bcb836f-4204-4fac-d133-99e81a0b2884" }, "outputs": [], "source": [ "# If you're using a virtualenv, make sure it's activated before running\n", "# this cell!\n", "!pip install requests\n", "!pip install BeautifulSoup4\n", "!pip install HTMLParser\n", "!pip install rdflib" ] }, { "cell_type": "markdown", "metadata": { "id": "irPnmIK4eMJd" }, "source": [ "## Exercise 1\n", "\n", "Even if web pages do not use microformat, interesting data can often be extracted from the HTML. You may use packages such as BeautifulSoup to extract arbitrary pieces of data from any HTML page.\n", "The example below shows how we can find the URL of first image in the infobox table of the wikipedia page on Amsterdam. Tip: compare the code below with HTML source code of the wikipedia page: the image url is in the \"src\" attribute of the \"img\" element of in the \"table\" element with class=\"infobox\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "id": "9gpHw90keMJf", "outputId": "7ae1fe64-8d85-4a47-cfdf-422284954d81" }, "outputs": [], "source": [ "# -*- coding: utf-8 -*-\n", "\n", "import requests\n", "from bs4 import BeautifulSoup\n", "\n", "# This script requires you to add a url of a page with geotags to the commandline, e.g.\n", "# python geo.py 'http://en.wikipedia.org/wiki/Amsterdam'\n", "URL = 'https://en.wikipedia.org/wiki/Amsterdam'\n", "\n", "req = requests.get(URL, headers={'User-Agent' : \"Social Web Course Student\"})\n", "soup = BeautifulSoup(req.text)\n", "# print(req.text)\n", "image1 = soup.findAll('table', class_='infobox')[0].find('img')\n", "print(image1['src']) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extracting coordinates from a webpage and reformatting them in the geo microformat (based on Example 8-1 in Mining the Social Web). Note that wikipages may encode long/lat information in different ways. On of the ways used by the Amsterdam wikipedia page is in a span element that is not shown to the user: \n", "<span class=\"geo\">52.367; 4.900</span>\n", "This span element has a single child: len(geoTag == 1) and no further structure, we have to manually get the long/lat by splitting the string on the ';' semicolon." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "id": "LtHtQT9PeMJl", "outputId": "8a7f7b52-cdb2-409f-b3f0-ee7adf60a9f7" }, "outputs": [], "source": [ "\n", "geoTag = soup.find(True, 'geo')\n", "print(geoTag)\n", "\n", "if geoTag and len(geoTag) > 1:\n", " lat = geoTag.find(True, 'latitude').string\n", " lon = geoTag.find(True, 'longitude').string\n", " print ('Location is at'), lat, lon\n", "elif geoTag and len(geoTag) == 1:\n", " (lat, lon) = geoTag.string.split(';')\n", " (lat, lon) = (lat.strip(), lon.strip())\n", " print (('Location is at'), lat, lon)\n", "else:\n", " print ('Location not found')\n" ] }, { "cell_type": "markdown", "metadata": { "id": "8S_bXnjveMJp" }, "source": [ "### Task 1\n", "\n", "Can you convert the output of Exercise 1 into KML? Here is the KML documentation: https://developers.google.com/kml/documentation/?csw=1 and here you can find a simple example of how it is used: https://renenyffenegger.ch/notes/tools/Google-Earth/kml/index\n", "\n", "Visualise the point in Google Maps using the following code example: https://developers.google.com/maps/documentation/javascript/examples/layer-kml-features\n", "You will have to create your own KML file for the custom map layer, and provide a URL to the KML file inside the JavaScript code, which means that you have to upload the file somewhere. You can use a service like http://pastebin.com/ to obtain a URL for your KML file —> paste the code there and request the RAW format URL; use this one in this Task1.\n", "\n", "Is KML a microformat, why (not)?" ] }, { "cell_type": "markdown", "metadata": { "id": "kUnka7EyeMJp" }, "source": [ "## Exercise 2 \n", "In order to find information in the web we can use microformats such as [hRecipe](https://microformats.org/wiki/hrecipe) or Schema.org's [Recipe](https://schema.org/Recipe). But first, we'll show you how to find arbitrary tags in a webpage.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "b0pBs-PVeMJq" }, "source": [ "### Task 2 \n", "Parsing data for a <sub><sup>veggie</sup></sub> spaghetti alla carbonara recipe (from Example 2-7 in Mining the Social Web)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mt9BK_CZeMJr" }, "outputs": [], "source": [ "import requests\n", "import json\n", "from bs4 import BeautifulSoup\n", "\n", "# A yummy webpage (feel free to change to your likings.)\n", "URL = \"https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara/\"\n", "\n", "# requests will return the html found at the given webpage...\n", "page = requests.get(URL)\n", "# ...and a BeautifulSoup object can be created from its content.\n", "soup = BeautifulSoup(page.content, 'html.parser')\n", "\n", "listchildren = list(soup.children)\n", "# print(listchildren)" ] }, { "cell_type": "markdown", "metadata": { "id": "IhdMwqykeMJt" }, "source": [ "We can find any element in the page through *css tag selectors*\n", "You can find them all [here](https://www.w3schools.com/cssref/css_selectors.asp), but shortly these are \".\" for classes, # for ids and plain text for the element name.\n", "\n", "\n", "You can also combine them, so that looking for \".class1.class2\" would select all elements displaying both classes. For a deeper overview please check the above link (or google \"html tag selectors\"). " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 71 }, "id": "PBaiK8OLeMJu", "outputId": "5b75f973-41c1-4ad4-fd9f-4f1f7665ba1d" }, "outputs": [], "source": [ "print(len(listchildren)) # we can see here how many children the html doc has got.\n", "ingredients_unparsed = soup.select_one(\".tasty-recipes-ingredients\")\n", "# let's get all the \"list item\" elements in a list:\n", "ing_unp = ingredients_unparsed.findAll('li')\n", "print(ing_unp)" ] }, { "cell_type": "markdown", "metadata": { "id": "tFXVPZhIeMJw" }, "source": [ "Mmmh... not so pretty yet. How about listing their items using the text method?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "xASBZsnMeMJx", "outputId": "7af0f6e9-3b4f-4f34-e444-794087d06e25" }, "outputs": [], "source": [ "\n", "ingredients = [t.text for t in ing_unp]\n", "print(\"Ingredients:\\n\")\n", "# [print(i) for i in ingredients] # Also prints the generator\n", "# Instead\n", "for ing in ingredients:\n", " print(ing)" ] }, { "cell_type": "markdown", "metadata": { "id": "O-RItVHyeMJz" }, "source": [ "Good. Now the instructions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 54 }, "id": "d-3Op4B6eMJ0", "outputId": "75a70f0c-86d3-4be9-d2d8-84df91c4f392" }, "outputs": [], "source": [ "instructions_unparsed = soup.select_one(\".tasty-recipes-instructions\")\n", "instructions_unparsed = instructions_unparsed.findAll(\"li\")\n", "print(instructions_unparsed)" ] }, { "cell_type": "markdown", "metadata": { "id": "wPWXuglfeMJ2" }, "source": [ "Let's finish off with the title:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "yg1TnWe2eMJ3", "outputId": "05d39a2e-3779-45f1-ddeb-9c6d2ae5f494" }, "outputs": [], "source": [ "title_unparsed = soup.select_one(\".post-header\") # \n", "categorical_title = title_unparsed.text.split(\"›\") # website specific divider.\n", "recipe_title = categorical_title[-1].strip() # let's remove that ugly space at the beginning.\n", "recipe_title" ] }, { "cell_type": "markdown", "metadata": { "id": "RYb6WtXYeMJ6" }, "source": [ "## Task 2.1\n", "Now it's your turn. Create a function that can scrape any recipe webpage from the same website (other websites will have different class tags). \n", "\n", "Make sure to:\n", "\n", "- return itemized content (e.g. ingredients) in a list. You may want to use a list comprehension here.\n", "- Not all items have been cleaned of their html markdown (see variables ```ingredients``` vs. ```instructions_unparsed```. Make sure to return a list with human readable content (i.e. by using the ```.text``` attribute).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 54 }, "id": "UQu9ecLEeMJ6", "outputId": "a8aa0e14-a8fb-4279-cf32-8dca97ab3412" }, "outputs": [], "source": [ "# -*- coding: utf-8 -*-\n", "\n", "import requests\n", "import json\n", "from bs4 import BeautifulSoup\n", "\n", "# Pass in a URL containing hRecipe, such as\n", "# https://www.jamieoliver.com/recipes/pasta-recipes/veggie-carbonara/\n", "\n", "URL = \"https://www.acouplecooks.com/mushroom-pasta-with-goat-cheese/\"#YOUR RECIPE HERE/\n", "\n", "# Parse out some of the pertinent information for a recipe.\n", "# See http://microformats.org/wiki/hrecipe.\n", "\n", "def parse_website(url):\n", " page = requests.get(url)\n", " soup = BeautifulSoup(page.content, 'html.parser')\n", " \n", " # You code here\n", " # Parse header and get the title\n", " title_unparsed = soup.select_one(\".post-header\") # \n", " categorical_title = title_unparsed.text.split(\"›\") # website specific divider.\n", " recipe_title = categorical_title[-1].strip() # let's remove that ugly space at the beginning.\n", " fn = recipe_title\n", "\n", " # Ingredients\n", " ingredients_unparsed = soup.select_one(\".tasty-recipes-ingredients\")\n", " # let's get all the \"list item\" elements in a list:\n", " ing_unp = ingredients_unparsed.findAll('li')\n", " ingredients = [t.text for t in ing_unp]\n", "\n", " # Instructions\n", " instructions_unparsed = soup.select_one(\".tasty-recipes-instructions\")\n", " instructions_unparsed = instructions_unparsed.findAll(\"li\")\n", " instructions = [t.text for t in instructions_unparsed]\n", "\n", " return {\n", " 'name': fn,\n", " 'ingredients': ingredients,\n", " 'instructions': instructions,\n", " }\n", " \n", "recipe = parse_website(URL)\n", "print (recipe)" ] }, { "cell_type": "markdown", "metadata": { "id": "ccURluAIeMJ8" }, "source": [ "But How can we get information not only from one website, but from all? \n", "\n", "The answer: microformats.\n", "\n", "But rather than extracting with information manually from the schema.org or hRecipe microformats, we can use a package, ```scrape-schema-recipe``` \n", "\n", "Feel free to experiment with it. " ] }, { "cell_type": "markdown", "metadata": { "id": "EBY-y_GreMJ8" }, "source": [ "### Task 2.2\n", "hRecipe is a microformat specifically created for recipes.\n", "Can you for example easily compare different dessert recipe ingredients? For inspiration you can look back at the exercises you did in Hands-on session 1 where you compared different sets of tweets." ] }, { "cell_type": "markdown", "metadata": { "id": "n-J8fiLbeMJ9" }, "source": [ "## Exercise 3" ] }, { "cell_type": "markdown", "metadata": { "id": "7XBeqJHVeMJ9" }, "source": [ "Schema.org is one of the most widely used annotations formats. Schema.org is a multipurpose template that has been created by a consortium consisting of Yahoo!, Google and Microsoft. It can describe entities, events, products etc. Check out the vocabulary specs on Schema.org." ] }, { "cell_type": "markdown", "metadata": { "id": "fiw8JClyeMJ-" }, "source": [ "### Task 3\n", "\n", "Parsing schema.org microdata. To parse this data you need to install the rdflib-microdata package, which you have done in one of the previous steps.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 153 }, "id": "X2zr3fOOeMJ-", "outputId": "d123f981-d73f-470f-b5e9-8735819f894b" }, "outputs": [], "source": [ "from rdflib import Graph\n", "\n", "# Source: https://www.youtube.com/watch?v=sCU214rbRZ0\n", "# Pass in a URL containing Schema.org microformats\n", "URL = \"http://dbpedia.org/resource/Micheal_Jackson\"\n", "\n", "# Initialize a graph\n", "g = Graph()\n", "\n", "# Parse in an RDF file graph dbpedia\n", "result = g.parse(location=URL)\n", "\n", "# Loop through first 10 triples in the graph\n", "for index, (sub, pred, obj) in enumerate(g):\n", " print(sub, pred, obj)\n", " if index == 10:\n", " break" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "id": "hrQ2EuY5JAn1", "outputId": "eba60ebb-7ac5-4451-c16e-3f68e66af7f3" }, "outputs": [], "source": [ "# Print the size of the Graph\n", "print(f'Graph has {len(g)} facts')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 323 }, "id": "IAO1JllwJMqO", "outputId": "08f5e32d-d1a6-4a30-878a-ce7b768a8811" }, "outputs": [], "source": [ "# Print out the entire Graph in the RDF Turtle format\n", "print(g.serialize(format='ttl'))" ] }, { "cell_type": "markdown", "metadata": { "id": "dzbynasAeMKA" }, "source": [ "### Task 3.1 \n", "Compare the schema.org information about a band on last.fm to the Facebook Open Graph information about the same band from Facebook. What are the differences? Which format do you think supports better interoperability?" ] }, { "cell_type": "markdown", "metadata": { "id": "Nocs4YDPeMKB" }, "source": [ "### Task 3.2\n", "Explore the various microformats at http://microformats.org/ and compare the output of the exercises with the output of http://microformats.org/. Think about possible microformats you want to support in your final assignment and read up on how to parse them." ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "Hands-on_2_microformats.ipynb", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 1 }View Code
In this session we are going to mine data in various microformats. You will see the differences in what each of the formats can contain and what purpose they serve. We will start by looking geographical data.
Prerequisities:
1. Python 3.8
2. Python packages: requests, BeautifulSoup4, HTMLParser, rdflib
什么是microformats呢?
Microformats即微格式,是结构化数据的开放标准,是包含数据的结构化XHTML代码块的定义格式,由于是XHTML代码块,所以很适合人类阅读,由于是结构化的,所以很容易被机器处理,很容易与外部进行数据通信。它是一种把语义嵌入到HTML以便有助于分离式开发而定制的一些简单规定,是对Web网页进行语义注解的方法。
实例
从上边的描述可知,微格式实际就是为现有的HTML元素添加元数据和其他属性,增强语义。以前我们是这样写一个链接到首页的<a></a>代码的:
<a href="http://www.bbon.cn">Web Design Blog</a>
而现在我们为这个代码元素<a>加上rel属性:
<a href=”http://www.bbon.cn“ rel=”homepage”>Web Design Blog</a>
上边的链接标记<a>包括rel="home"属性,该属性显示链接的目标页面是该网站的首页。通过为已有的链接元素添加语义化属性,就为这个链接添加了具体的结构和意义。
hCard微格式
hCard是一种微格式,用来发布个人,公司,组织,地点等详细的联系信息。他可以包含在HTML,Atom,RSS等可扩展置标语言中,hCard使用vCard的属性和值来实现这些功能。
示例
如下HTML代码:
<div> <div>Joe Doe</div> <div>The Example Company</div> <div>604-555-1234</div> <a href="http://example.com/">http://example.com/</a> </div>
加入微格式后就变成:
<div class="vcard"> <div class="fn">Joe Doe</div> <div class="org">The Example Company</div> <div class="tel">604-555-1234</div> <a class="url" href="http://example.com/">http://example.com/</a> </div>
这里,正式名称(class="fn"),组织(class="org"),电话号码(class="tel")和url(class="url")分别用相应的class标示;同时,所有内容都包含在class="vcard"里。
在前端构建微格式的意义
微格式按照某种已有的被广泛使用的标准,通过对内容块的语义标记,可以让外部应用程序、聚合程序和搜索引擎能够做以下事情:
1. 在爬取Web内容时,能够更为准确地识别内容块的语义;
2. 在对内容进行操作,包括提供访问、校对,还可以将其转化成其他的相关格式,提供给外部程序和Web服务使用。
Exercise 1
即使网页不使用微格式,也可以从HTML中提取出有趣的数据。我们可以使用BeautifulSoup这样的包从任何HTML网页中提取任意数据片段。下边的代码展示了我们如何在阿姆斯特丹维基百科页面信息框表中找到第一张图片的URL。Tip: the image url is in the "src" attribute of the "img" element of in the "table" element with class="infobox".
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup # This script requires you to add a url of a page with geotags to the commandline, e.g. # python geo.py 'http://en.wikipedia.org/wiki/Amsterdam' URL = 'https://en.wikipedia.org/wiki/Amsterdam' req = requests.get(URL, headers={'User-Agent' : "Social Web Course Student"}) soup = BeautifulSoup(req.text) # print(req.text) image1 = soup.findAll('table', class_='infobox')[0].find('img') print(image1['src'])
输出为图片的url://upload.wikimedia.org/wikipedia/commons/thumb/b/be/KeizersgrachtReguliersgrachtAmsterdam.jpg/270px-KeizersgrachtReguliersgrachtAmsterdam.jpg
标签:Web,type,cell,source,https,representation,data,id,metadata From: https://www.cnblogs.com/lbwBH/p/16864778.html