标签：city 15 San 445 trip station SQL id select

cmu15-445是一门关于数据库的课程，看到该课程的homeworks和project就觉得十分有挑战性。本文是对该课程的Homework1， SQL语句的十道题，做完以后可以对SQL查询有个好的了解。
Sqlite安装配置省略
Homework的指导页面

课程提供的数据库表关系如下

下面十道题都将对该表进行操作。
三个表，
trip：包含id，来去车站，来去时间，自行车id
station：车站，车站所属的城市，
天气：与城市对应

Q1 查询名字唯一的城市的数量

Count the number of cities. The purpose of this query is to make sure that the formatting of your output matches exactly the formatting of our auto- grading script.
Details: Print the number of cities (eliminating duplicates).

Answer: Here's the correct SQL query and expected output:
x

sqlite> select count(distinct(city)) from station;
5

Q2 （分组聚集，排序，计数 count）

需要查询每个城市的火车站的数量，并且先按照每个城市的火车站的数量排序，再按照城市名排序（都是升序）。此题并不难
Count the number of stations in each city.
Details: Print city name and number of stations. Sort by number of stations (increasing), and break ties by city name (increasing).

sqlite> select city, count(station_id) as cnt
   ...> from station
   ...> group by city
   ...> order by cnt asc, city asc;
Palo Alto|5
Mountain View|7
Redwood City|7
San Jose|16
San Francisco|35

注意group的聚合， group会将所有具有相同属性的元组分到一个组。然后不能对没有出现在group by子句中的属性进行select（或者是聚集以后该属性的值是唯一的也可以）。一般会使用count， avg等对一个分组内的几个元组的该的属性进行聚集操作

Q3 多关系查询

查询每个城市名，以及其所占trip总数的比例（当其作为trip的起点或终点时候，起点终点不重复计数）。按照如下步骤，将sql语句分解
Find the percentage of trips in each city. A trip belongs to a city as long as its start station or end station is in the city. For example, if a trip started from station A in city P and ended in station B in city Q, then the trip belongs to both city P and city Q. If P equals to Q, the trip is only counted once.
Details: Print city name and ratio between the number of trips that belong to that city against the total number of trips (a decimal between 0-1, round to four decimal places using ROUND()). Sort by ratio (decreasing), and break ties by city name (increasing).

首先，将trip和station两个表根据station id联系起来 where判断关联


sqlite> select ID, city, station_id, station_id, end_station_id
   ...> from trip, station
   ...> where station_id = start_station_id or station_id = end_station_id
   ...> limit 10;
4069|San Francisco|64|64|64
4073|San Francisco|66|66|69
4073|San Francisco|69|69|69
4074|San Francisco|66|66|69
4074|San Francisco|69|69|69
4075|San Francisco|66|66|69
4075|San Francisco|69|69|69
4076|San Francisco|66|66|69
4076|San Francisco|69|69|69
4078|Redwood City|22|22|22

这种查询称为多关系查询

select A1, A2...  属性
from r1,r2.. 关系
where P; 谓词

尽管句子必须以select form where 的次序写出，但查询所代表的运算首先是：from，然后是where，最后是select
from句定义所有关系的笛卡尔积，然后由where使用谓词来限制笛卡尔积所创建的元组， select最后指定属性

然后需要根据city进行分组，分组的时候计算这个城市参与的trip的个数，使用count加distinct。

sqlite> select city, count(distinct(ID))
   ...> from trip, station
   ...> where station_id = start_station_id or station_id = end_station_id
   ...> group by city
   ...> limit 10;
Mountain View|18606
Palo Alto|7271
Redwood City|3494
San Francisco|603733
San Jose|37897

计算分母，旅行的总数量

select count(*) as cnt from trip;

经常使用count计算关系中元组的数量

使用as连接起来，将这几个操作结合起来

sqlite> select city_trip_cnt.city, (ROUND(city_trip_cnt.cnt*1.0 / all_trip_cnt.cnt, 4)) as ratio
   ...> from (
   ...>     select city, count(distinct(ID)) as cnt
   ...>     from trip, station
   ...>     where station_id = start_station_id or station_id = end_station_id
   ...>     group by city
   ...> ) as city_trip_cnt, (select count(*) as cnt from trip) as all_trip_cnt
   ...> order by ratio desc, city asc
   ...> ;
San Francisco|0.9011
San Jose|0.0566
Mountain View|0.0278
Palo Alto|0.0109
Redwood City|0.0052

as起了个别名，不然会有重复的属性

这里，from后面的有两部分，用了一个逗号隔开了，每个部分的表起了一个别名，其实不起别名直接用了列也能找到，但是可能会出现不同表里的列冲突的问题。

desc降序， asc升序

Q4 with as句式每组里面最大的属性a对应的属性b

找到每个城市最收欢迎的车站，输出访问次数。
For each city, find the most popular station in that city. "Popular" means that the station has the highest count of visits. As above, either starting a trip or finishing a trip at a station, the trip is counted as one "visit" to that station. The trip is only counted once if the start station and the end station are the same.
Details: For each station, print city name, most popular station name and its visit count. Sort by city name, ascending.
这里有个条件，出发城市和结束城市都要计算，而且如果是同一个城市只计算一次。这样直接使用where拼表即可

sqlite> select city, station_id, station_name, count(distinct(ID)) as cnt  其实这里对ID不必使用distinct
   ...> from trip, station
   ...> where station_id = start_station_id or station_id = end_station_id
   ...> group by station_id
   ...> limit 10;
San Jose|2|San Jose Diridon Caltrain Station|18782
San Jose|3|San Jose Civic Center|2989
San Jose|4|Santa Clara at Almaden|7368
San Jose|5|Adobe on Almaden|2348
San Jose|6|San Pedro Square|5883
San Jose|7|Paseo de San Antonio|4582
San Jose|8|San Salvador at 1st|3239
San Jose|9|Japantown|3880
San Jose|10|San Jose City Hall|3822
San Jose|11|MLK Library|4364

这里使用where连接后，根据笛卡尔积，交叉后筛选出来了符合条件的，然后聚集，之后使用count计数。注意这里city不在group by子句中但是也可以select，因为city对station_id聚集后的分组内是唯一的。

因为要对上面这个表反复使用，使用with as句式命名为visit
这时还有个问题，要找出同一个city下的的最大属性a元组对应的属性b，可以先按city分组后使用max聚集函数求出最大值，然后找出等于这个最大值的元组，select属性b

with visit(station_id, station_name, city, cnt) as (
    select station_id, station_name, city, count(distinct(ID)) as cnt
    from trip, station
    where station_id = start_station_id or station_id = end_station_id
    group by station_id
)
select visit.city, visit.station_name, visit.cnt
from visit
where visit.cnt = (
    select max(max_visit.cnt)
    from visit as max_visit
    where max_visit.city = visit.city
)
order by city
;

Q5 日期，时间戳， union合并两个或者多个select语句的结果 round操作

Find the top 10 days that have the highest average bike utilization. For simplicity, we only consider trips that use bikes with id <= 100. The average bike utilization on date D is calculated as the sum of the durations of all the trips that happened on date D divided by the total number of bikes with id <= 100, which is a constant. If a trip overlaps with date D, but starts before date D or ends after date D, then only the interval that overlaps with date D (from 0:00 to 24:00) will be counted when calculating the average bike utilization of date D. And we only calculate the average bike utilization for the date that has been either a start or an end date of a trip. You can assume that no trip has negative time (i.e., for all trips, start time <= end time).
查找平均自行车利用率最高的前 10 天。为简单起见，我们只考虑使用 id <= 100 的自行车的行程。日期 D 的平均自行车利用率计算为日期 D 发生的所有行程的持续时间之和除以 id <= 的自行车总数= 100，这是一个常数。如果行程与日期 D 重叠，但在日期 D 之前开始或在日期 D 之后结束，则在计算日期 D 的平均自行车利用率时，仅计算与日期 D 重叠的间隔（从 0:00 到 24:00） . 而且我们只计算旅行开始或结束日期的平均自行车利用率。您可以假设没有行程有负时间（即，对于所有行程，开始时间 <= 结束时间）。

Details: For the dates with the top 10 average duration, print the date and the average bike duration on that date (in seconds, round to four decimal places using the ROUND() function). Sort by the average duration, decreasing. Please refer to the updated note before Q1 when calculating the duration of a trip.
详细信息：对于平均持续时间排名前 10 的日期，打印日期和该日期的平均自行车持续时间（以秒为单位，使用 ROUND() 函数四舍五入到小数点后四位）。按平均持续时间排序，递减。计算行程时长请参考Q1之前的更新说明。

Hint: All timestamps are stored as text after loaded from csv in sqlite. You can use datetime(timestamp string) to get the timestamp out of the string and date(timestamp string) to get the date out of the string. You may also find the funtion strftime() helpful in computing the duration between two timestamps.
提示：所有时间戳在从 sqlite 中的 csv 加载后都存储为文本。您可以使用 datetime(timestamp string) 从字符串中获取时间戳，使用 date(timestamp string) 从字符串中获取日期。您可能还会发现函数 strftime() 有助于计算两个时间戳之间的持续时间。

只计算旅行开始和结束这两天的使用率，中间过程不算使用。

with dates as (
	select date(start_time) as tdate 
	from trip 
	union
	select date(end_time) as tdate
	from trip)
select tdate, round(sum(strftime('%s', min(datetime(end_time), datetime(tdate, '+1 day'))) - strftime('%s', max(datetime(start_time), datetime(tdate)))) * 1.0 / (select count(distinct(bike_id)) from trip where bike_id <= 100), 4) as avg_duration 
from trip, dates 
where bike_id <= 100 and datetime(start_time) < datetime(tdate, '+1 day') and datetime(end_time) > datetime(tdate) 
group by tdate 
order by avg_duration desc 
limit 10;

SQL UNION 操作符合并两个或多个 SELECT 语句的结果。得到要进行统计的日子。
round(数字，位数)
通过该问题，了解了日期的函数究竟是怎么一回事。date：年月日，time：时分秒，datetime：年月日+时分秒。

另外，sum，strftime函数的意义，以及with语句中其实也不全是需要像上面那样那种方式对列去别名

以及union对相同数据的合并去重操作，distinct是对同一列去重。 union是两个表的列去重

扩展：SQLite的五个时间函数：

date(timestring, modifier, modifier, …)**：以 YYYY-MM-DD 格式返回日期
time(timestring, modifier, modifier, …)**：以 HH:MM:SS 格式返回时间
datetime(timestring, modifier, modifier, …)**：以 YYYY-MM-DD HH:MM:SS 格式返回日期时间
julianday(format, timestring, modifier, modifier, ..)：返回从格林尼治时间的公元前 4714 年 11 月 24 日正午算起的天数
strftime(format, timestring, modifier, modifier, ..)：根据第一个参数指定的格式字符串返回格式化的日期
讲道理其他四个函数都可以用 strftime() 函数来表示：

Q6 多重条件，对一个关系使用两次

One of the possible data-entry errors is to record a bike as being used in two different trips, at the same time. Thus, we want to spot pairs of overlapping intervals (start time, end time). To keep the output manageable, we ask you to do this check for bikes with id between 100 and 200 (both inclusive). Note: Assume that no trip has negative time, i.e., for all trips, start time <= end time.
一种可能的数据输入错误是记录一辆自行车同时用于两次不同的旅行。因此，我们想要发现成对的重叠间隔（开始时间、结束时间）。为了使输出易于管理，我们要求您使用 id between 100 and 200 (both inclusive) 对自行车进行此检查。注意：假设没有行程有负时间，即对于所有行程，开始时间 <= 结束时间。
Details: For each conflict (a pair of conflict trips), print the bike id, former trip id, former start time, former end time, latter trip id, latter start time, latter end time. Sort by bike id (increasing), break ties with former trip id (increasing) and then latter trip id (increasing).
详细信息：对于每个冲突（一对冲突行程），打印自行车 ID、前行程 ID、前开始时间、前结束时间、后行程 ID、后开始时间、后结束时间。按自行车 ID（增加）排序，与前一个行程 ID（增加）和后一个行程 ID（增加）断开联系。

Hint: (1) Report each conflict pair only once, so that former trip id < latter trip id. (2) We give you the (otherwise tricky) condition for conflicts: start1 < end2 AND end1 > start2
提示：（1）每个冲突对只报告一次，所以 former trip id < latter trip id 。 (2) 我们给你冲突的（否则棘手的）条件：start1 < end2 AND end1 > start2

基本都是按照题目的提示来写的，比较简单。
写成这种形式看起来比较舒服

select ftrip.bike_id, ftrip.id, ftrip.start_time, ftrip.end_time, ltrip.id, ltrip.start_time, ltrip.end_time
from trip as ftrip, trip as ltrip
where ftrip.bike_id between 100 and 200 
    and ftrip.bike_id == ltrip.bike_id
    and ftrip.id < ltrip.id 
    and ftrip.start_time < ltrip.end_time
    and ftrip.end_time > ltrip.start_time
order by ftrip.bike_id asc, ftrip.id asc, ltrip.id asc
;

Q7 having 子句对分组进行限制条件而不是对元组

Find all the bikes that have been to more than one city. A bike has been to a city as long as the start station or end station in one of its trips is in that city.
找出所有去过一个以上城市的自行车。只要其中一次旅行的起点站或终点站在那个城市，一辆自行车就已经去过那个城市。
Details: For each bike that has been to more than one city, print the bike id and the number of cities it has been to. Sort by the number of cities (decreasing), then bike id (increasing).
详细信息：对于去过多个城市的每辆自行车，打印自行车 ID 和去过的城市数量。按城市数量（递减）排序，然后按自行车编号（递增）排序。
先根据来去城市拼表，将一次旅行中去的城市中等于station的元组用where筛选出来，之后自行车id进行group后，一个分组内的不同因为来去车站而包含多个city，再对city使用distinct得到cnt，使用having语句。

select trip.bike_id, count(distinct(station.city)) as cnt
from trip, station
where trip.start_station_id == station.station_id 
    or trip.end_station_id == station.station_id
group by trip.bike_id
having cnt > 1
order by cnt desc, bike_id asc

这里having限制的是每个分组而不是元组。 having次序在聚集之后，select之前。次序from where group having select

Q8

Find what is the average number of trips made per day on each type of weather day. The type of weather on a day is specified by weather.events, such as 'Rain', 'Fog' and so on. For simplicity, we consider all days that does not have a weather event (weather.events = '\N') as a single type of weather. Here a trip belongs to a date only if its start time is on that date. We use the weather at the starting position of that trip as its weather type as well. There are also 'Rain' and 'rain' in weather.events. For simplicity, we consider them as different types of weathers. When counting the total number of days for a weather, we consider a weather happened on a date as long as it happened in at least one region on that date.
求出在每种天气类型下每天的平均出行次数。一天的天气类型由 weather.events 指定，例如 'Rain'、'Fog' 等。为简单起见，我们将所有没有天气事件 ( weather.events = '\N' ) 的日子视为单一类型的天气。此处，只有开始时间在该日期的旅行才属于该日期。我们也使用该行程起始位置的天气作为其天气类型。 weather.events 中也有 'Rain' 和 'rain'。为简单起见，我们将它们视为不同类型的天气。在计算天气的总天数时，我们认为天气发生在某个日期，只要它在该日期至少发生在一个地区即可。
Details: Print the name of the weather and the average number of trips made per day on that type of weather (round to four decimal places using ROUND()). Sort by the average number of trips (decreasing), then weather name (increasing).
详细信息：打印天气名称和在该天气类型下每天的平均出行次数（使用 ROUND() 舍入到 four 小数位）。按平均出行次数（递减）排序，然后是天气名称（递增）。

标签：city,15,San,445,trip,station,SQL,id,select
From： https://www.cnblogs.com/lion-cheng/p/17219037.html

CMU 15-445 数据库系统 Homework SQL查询

课程提供的数据库表关系如下

Q1 查询名字唯一的城市的数量

Q2 （分组聚集，排序，计数 count）

Q3 多关系查询

首先，将trip和station两个表根据station id联系起来 where判断关联

然后需要根据city进行分组，分组的时候计算这个城市参与的trip的个数，使用count加distinct。

计算分母，旅行的总数量

使用as连接起来，将这几个操作结合起来

Q4 with as句式每组里面最大的属性a对应的属性b

Q5 日期，时间戳， union合并两个或者多个select语句的结果 round操作

Q6 多重条件，对一个关系使用两次

Q7 having 子句对分组进行限制条件而不是对元组

Q8

相关文章

赞助商

阅读排行

CMU 15-445 数据库系统 Homework SQL查询

课程提供的数据库表关系如下

Q1 查询名字唯一的城市的数量

Q2 （分组聚集，排序， 计数 count）

Q3 多关系查询

首先，将trip和station两个表根据station id联系起来 where判断关联

然后需要根据city进行分组，分组的时候计算这个城市参与的trip的个数，使用count加distinct。

计算分母，旅行的总数量

使用as连接起来，将这几个操作结合起来

Q4 with as句式 每组里面最大的属性a对应的属性b

Q5 日期，时间戳， union合并两个或者多个select语句的结果 round操作

Q6 多重条件，对一个关系使用两次

Q7 having 子句对分组进行限制条件而不是对元组

Q8

相关文章

赞助商

阅读排行

Q2 （分组聚集，排序，计数 count）

Q4 with as句式每组里面最大的属性a对应的属性b