首页 > 其他分享 >554.488/688   应用数学计算

554.488/688   应用数学计算

时间:2023-05-12 09:03:32浏览次数:50  
标签:554.488 word positive your 688 数学计算 data pages present


554.488/688 Computing for Applied Mathematics
Spring 2023 - Final Project Assignment
The aim of this assignment is to give you a chance to exercise your skills at prediction using
Python. You have been sent an email with a link to data collected on a random sample from some
population of Wikipedia pages, to develop prediction models for three different web page attributes.
Each student is provided with their own data drawn from a Wikipedia page population unique
to that student, and this comes in the form of two files:
ˆ A training set which is a pickled pandas data frame with 200,000 rows and 44 columns. Each
row corresponds to a distinct Wikipedia page/url drawn at random from a certain population
of Wikipedia pages. The columns are
– URLID in column 0, which gives a unique identifier for each url. You will not be able to
determine the url from the URLID or the rest of the data. (It would be a waste of time
to try so the only information you have about this url is provided in the dataset itself.)
– 40 feature/predictor variable代做554.488/688程序   columns in columns 1,...,40 each associated with a particular
word (the word is in the header). For each url/Wikipedia page, the word column gives
the number of times each word appears in the asociated page.
– Three response variables in columns 41, 42 and 43
* length = the length of the page, defined as the total number of characters in the
page
* date = the last date when the page was edited
* word present = a binary variable indicating whether at least one of 5 possible words
(using a word list of 5 words specific to each student and not among the 40 feature
words) 1 appears in the page
A test set which is also a pickled pandas data frame with 50,000 rows but with 41 columns
since the response variables (length, date, word present) are not available to you. The rows
of the test dataset also correspond to distinct url/pages drawn from the same Wikipedia
url/page population as the training dataset (with no pages in common with the training set
pages). The response variables have been removed so that the columns that are available are
– URLID in column 0
– the same 40 feature/predictor variable columns corresponding to word counts for the
same 40 words as in the training set
Your goal is to use the training data to
predict the length variable for pages in the test dataset
1What this list of 5 words is will not be revealed to you and you it would be a waste of time tring to figure out
what it is.
predict the mean absolute error you expect to achieve in your predictions of length in the test
dataset

测试集,也是一个腌熊猫数据帧,有50000行,但有41列

因为响应变量(长度、日期、单词存在)对您不可用。行

的测试数据集也对应于从同一维基百科中提取的不同url/页面

url/page population作为训练数据集(没有与训练集共同的页面

页面)。响应变量已被删除,因此可用的列为

–第0列中的URLID

–相同的40个特征/预测变量列,对应于

与训练集中的40个单词相同

您的目标是使用培训数据

预测测试数据集中页面的长度变量

1这5个单词的清单是什么,不会透露给你和你——弄清楚是浪费时间

它是什么。

预测你期望在测试中预测长度时达到的平均绝对误差

数据集


predict word present for pages in the test dataset, attempting to make the false positive as
close as you can to .05 2
, and make the true positive rates as high as you possibly can 3
,
predict your true positive rate for word present in the test dataset
predict edited 2023 for pages in the test dataset, attempting to make the false positive as
close as you can to .05 4
, and make the true positive rates as high as you possibly can 5
,
predict your true positive rate for edited 2023 in the test dataset
Since I have the response variable values (length, word present, date) for the pages in your test
dataset, I can determine the performance of your predictions. Since you do not have those variables,
you will need to set aside some data in your training set or use cross-validation to estimate the
performance of your prediction models.
There are 3 different parts of this assignment, each requiring a submission:
Part 1 (30 points) - a Jupyter notebook containing
– a description (in words, no code) of the steps you followed to arrive at your predictions
and your estimates of prediction quality - including a description of any separation of
your training data into training and testing data, method you used for imputation,
methods you tried to use for making predictions (e.g. regression, logistic regression, ...)
followed by
– the code you used in your calculations
Part 2 (60 points) - a cvs file with your predictions - this file should consist of exactly 4
columns with 6
– a header row with URLID, length, word present, edited 2023
– 50,000 additional rows
– every URLID in your test dataset appearing in the URLID column - not altered in any
way!
– no mssing values
– data type for the length column should be integer or float
– data type for the word present column should be either integer (0 or 1), float (0. or 1.)
or Boolean (False/True)
2
false positive rate = proportion of pages for which word present is 0 but predicted to be 1
3
true positive rate = proportion of pages for which word present is 1 and predicted to be 1
4
false positive rate = proportion of pages for which edited 2023 is 0 but predicted to be 1
5
true positive rate = proportion of pages for which edited 2023 present is 1 and predicted to be 1
6
a notebook is provided to you for checking that your csv file is properly formatted
– data type for the edited 2023 column should be either integer (0 or 1), float (0. or 1.)
or Boolean (False/True)
Part 3 (30 points) - providing estimates of the following in a form:
– what do you predict the mean absolute error of your length predictions to be?
– what do you predict the true positive rate for your word present predictions to be?
– what do you predict the true positive rate for your edited 2023 predictions to be?
Your score in this assignment will be based on
Part 1 (30 points)
– evidence of how much effort you put into the assignment (how many different methods
did you try?)
– how well did you document what you did?
– was your method for predicting the quality of your performance prone to over-fitting?
Part 2 (60 points)
– how good are your predictions of length, word present, edited 2003 - I will do predictions
using your training data and I will compare
* your length mean absolute deviation to what I obtained in my predictions
* your true positive rate to what I obtained for the binary variables (assuming you
managed to appropriately control the false positive rate)
– how well did you meet specifications - did you get your false positive rate in predictions
of the binary variables close to .05 (again, compared to how well I was able to do this)
Part 3 (30 points)
– how good is your prediction of the length mean absolute deviation
– how good is your prediction of the true positive rate for the word present variable
– how good is your prediction of the true positive rate for the edited 2023 variable
How the datasets were produced
This is information that will not be of much help to you in completing the assignment, except
maybe to convince you that there would be no point in using one of the other students’ data in
completing this assignment.
ˆ I web crawled in WIkipedia to arrive at a random sample of around 2,000,000 pages.
ˆ I made a list of 100 random words and extracted the length, the word counts, and the last
date edited for each page.
To create one of the student personal datasets, I repeated the following steps for each student
Repeat
Chose 10 random words w0,w1,...,w9 out of the 100 words in the list above
Detemined the subsample of pages having w0 and w1 but not w2, w3 or w4.
Used the words w5,w6,w7,w8 and w9 to create the word_present variable
Until
the subsample has at least 250,000 pages
Randomly sampled 40 of 90 unsampled words without replacement
Randomly sampled without replacement 250,000 pages out of the subsample
Retained only the 250,000 pages and
word counts for the 40 words
length
word_present
last date edited
Randomly assigned missing values in the feature (word count) data
Randomly separated the 250,000 pages into
200,000 training pages
50,000 test pages

 

标签:554.488,word,positive,your,688,数学计算,data,pages,present
From: https://www.cnblogs.com/wolfjava/p/17392758.html

相关文章

  • 全网商品搜索|1688|Taobao|天猫|京东api接口展示示例
    ​电商API(ApplicationProgrammingInterface,应用程序编程接口)是指电商平台开放的一组数据接口,通过这些接口可以实现对电商平台商品、订单、物流等信息进行访问、查询、修改、删除等操作。电商API涉及到的主要数据包括:1.商品数据:包括商品名称、价格、库存、分类、描述、图片......
  • php获取1688阿里巴巴关键字搜索新品数据API接口、获取上新关键词推荐、获取宝贝详情数
    ​ php的主要优势以及特点: 便于学习和使用:PHP是一门非常容易学习和使用的语言,其语法和结构都非常简单。具有广泛的应用范围:PHP可以用于开发各种类型的Web应用,如博客系统、内容管理系统、电子商务网站、社交网络等。巨大的社区支持:有一个庞大的PHP社区,提供了大量的......
  • 1688阿里巴巴中国站图片识别商品API接口、搜图链接、收藏加购接口
    ​API(ApplicationProgrammingInterface)是现代移动应用程序开发和互联网服务有机结合的产物。API的应用使得应用程序之间的通信变得更加轻松、快捷,尤其对于业务复杂而庞大的企业系统,API让开发者能够从中提取必要的功能进行二次开发,有效地加快了应用程序开发的速度。接下来小编......
  • 1688|Taobao|JD京东api接口获取商品详情C++演示案例
    ​ 商品详情页的作用:介绍产品信息、给出购买理由、提升信任感、提出售后保障。1、介绍产品信息:产品信息表做得越完整,越能让用户更细致了解产品,也减少了售前客服咨询的工作量。2、给出购买理由:在用户初步了解了产品信息后,商家就需要展示商品优势,给出核心卖点。接下来小编会展......
  • 利用Python爬虫采集1688商品详情数据 +商品列表数据+商品API接口(支持全网)
    一、如何通过手动方式查看1688商品详情页面的数据1.1688商品详情API接口(item_get-获得1688商品详情接口),1688API接口代码对接可以获取到宝贝ID,宝贝标题,价格,掌柜名称,库存,最小购买数,宝贝链接,宝贝图片,品牌名称,商品详情,详情图片等页面上有的数据均可以获取到,手动方式如下:例......
  • C++获取阿里巴巴1688中国站店铺详情 API 接口返回值示例说明
    ​C++(cplusplus)是一种计算机高级程序设计语言,由C语言扩展升级而产生,最早于1979年由本贾尼·斯特劳斯特卢普在AT&T贝尔工作室研发。C++既可以进行C语言的过程化程序设计,又可以进行以抽象数据类型为特点的基于对象的程序设计,还可以进行以继承和多态为特点的面向对象的程序设计。......
  • Java获取1688商品详情API接口示例说明
    ​ 在使用JavaWeb类的时候,如果我们需要获取一个网站中某个商品的详细信息,我们可以使用JavaScript来获取。我们可以用JavaScript来实现一个获取商品详情的API接口,来获取一个网站中某个商品的详细信息。在使用JavaScript进行接口请求时,可以使用下面的方法:通过javascript获......
  • 浅谈1688商品详情的应用场景
    场景分析1688商品详情接口是一种用于访问阿里巴巴旗下的批发市场平台(1688.com)上的商品信息的API接口。通过该接口,可以获取商品的详细信息,包括商品名称、规格、价格、描述、图片等。这些信息对于买家和卖家来说都非常重要,可以帮助他们更好地了解商品,做出更明智的购买决策。以下是168......
  • 跟姥爷深度学习4 从数学计算看神经网络
    一、前言我们前面简单的做了一个气温预测,经过反复调试,效果还不错。实际上在这个方向上我们还可以更进一步优化,但因为我们是学习嘛,主要还是看广度而不是深度。考虑到后面要开始学习卷积网络,我们必须把更基础的内容搞明白才行,比如神经网络到底是如何工作的,如果不搞明白后面卷积就只......
  • Linux实验报告-上 海 开 放 大 学-指导: linux-1688
    上海开放大学实验名称:Linux操作系统安装配置简答题(直接打字回答在题目下方,仅完成4道题目的同学,最高30分。另外70分是加给自己搭建实验环境完成实验报告的同学):1、Linux和Windows系统有哪些不同之处?指导加VX:linux-16882、Linux系统的有哪些主要的发行版本?指导加VX:linux......