MIT6.830-Lab3

标签：join int t2 t1 Lab3 cost MIT6.830 连接

概览

IntHistogram作用
1. 估计条件过滤某表的某个Int字段时，结果集记录数和该表记录总数的比例；
StringHistogram类似IntHistogram
TableStats作用
1. 存放某表各个字段对应的直方图对象（IntHistogram或StringHistogram类型）；
2. 使用静态对象ConcurrentMap<String, TableStats> statsMap存放数据库中各表的TableStats对象；
3. 计算按某个字段的值过滤后，结果集记录数和该表记录总数的比例；
LogicalPlan作用
1. 分别将查询语句的各部分包装成各类对象，以便后续处理；
2. 将查询语句进行优化，使得消耗资源最低；
JoinOptimizer作用
1. 将连接查询进行优化，返回最佳的连接顺序；
Parser作用
1. 用于加载数据库表目录文件catalog.txt到内存中；
2. 使用tableStatsMap成员变量生成数据库内各个表的字段的直方图（IntHistogram，StringHistogram，用来估计查询或连接操作的消耗）；
3. 用于解析输入的查询语句，将其包装为LogicalPlan类对象(LogicalPlan类中的physicalPlan()方法可以返回优化后结果集的OpIterator类对象)；
4. 打印查询结果到控制台；

实验部分

实验1

为了简化计算，扫描损失和连接损失的公式如下（ntups代表表中记录数）：

scancost(t1)=ntups(t1) x SCALING_FACTOR.
joincost(t1 join t2) = scancost(t1) + ntups(t1) x scancost(t2) //IO cost
                       + ntups(t1) x ntups(t2)  //CPU cost

此外，文档中建议使用直方图估计法简化含有过滤条件的损失计算，以对某个Int字段过滤为例，先获取表中该字段的最大值max和最小值min，然后将[min,max]这个区域划分为n个桶，之后再遍历一遍表，依照每条表记录中该字段的值，更新对应位置桶的大小，最终就获得了估算过滤损失的直方图。

IntHistogram

public double estimateSelectivity(Predicate.Op op, int v) {
        int idx = (int) ((v-min)/bucketWidth);
        switch (op){
            case EQUALS:
                if(v<min || v>max)
                    return 0.0;
                else {
                    double eqTuples = buckets[idx]*1.0/bucketWidth;
                    return eqTuples/totalTuples;
                }
            case GREATER_THAN:
                if(v<min)
                    return 1.0;
                else if(v>max)
                    return 0.0;
                else {
                    int right = (int) (min+(idx+1)*bucketWidth-1);
                    double sumGtTuples = (right-v)*1.0/bucketWidth*buckets[idx];
                    for(int j=idx+1;j<buckets.length;j++)
                        sumGtTuples+=buckets[j];
                    return sumGtTuples/totalTuples;
                }
            case LESS_THAN:
                if(v<min)
                    return 0.0;
                else if(v>max)
                    return 1.0;
                else {
                    int left = (int) (min+idx*bucketWidth);
                    double sumLtTuples = (v-left)*1.0/bucketWidth*buckets[idx];
                    for(int j=0;j<idx;j++)
                        sumLtTuples+=buckets[j];
                    return sumLtTuples/totalTuples;
                }
            case NOT_EQUALS:
                return 1.0 - estimateSelectivity(Predicate.Op.EQUALS,v);
            case GREATER_THAN_OR_EQ:
                return estimateSelectivity(Predicate.Op.GREATER_THAN,v-1);
            case LESS_THAN_OR_EQ:
                return estimateSelectivity(Predicate.Op.LESS_THAN,v+1);
            default:
                throw new RuntimeException("illegal operator");
        }
    }

StringHistogram
它将每个String转为int值，借助内部的IntHistogram成员变量实现估算直方图。

实验2

TableStats
使用Map<Integer,IntHistogram> intHistogramMap和Map<Integer,StringHistogram> strHistogramMap来存储表中字段对应的直方图对象（key是字段序号），核心方法是构造方法，大致思路是：先获取每个整形字段的最大最小值，接着根据这些值初始化各整形字段对应的IntHistogram对象，最后再次遍历表，设置每个IntHistogram对象和StringHistogram对象的桶大小。

    public TableStats(int tableid, int ioCostPerPage) {
        this.ioCostPerPage = ioCostPerPage;
        DbFile databaseFile = Database.getCatalog().getDatabaseFile(tableid);
        TupleDesc tupleDesc = databaseFile.getTupleDesc();
        HashMap<Integer, Integer> max = new HashMap<>();
        HashMap<Integer, Integer> min = new HashMap<>();
        this.intHistogramMap = new HashMap<>();
        this.strHistogramMap = new HashMap<>();
        // 初始化整型字段的值
        for(int i=0;i<tupleDesc.numFields();i++){
            if(tupleDesc.getFieldType(i) == Type.INT_TYPE){
                max.put(i,Integer.MIN_VALUE);
                min.put(i,Integer.MAX_VALUE);
            }
        }
        // 获取各整型字段的最大最小值
        int tupleNum = 0;
        DbFileIterator iterator = databaseFile.iterator(tid);
        try {
            iterator.open();
            while (iterator.hasNext()){
                tupleNum++;
                Tuple next = iterator.next();
                for(int i=0;i<tupleDesc.numFields();i++){
                    if(tupleDesc.getFieldType(i) == Type.INT_TYPE){
                        IntField intField = (IntField) next.getField(i);
                        if(intField.getValue() > max.get(i)){
                            max.put(i,intField.getValue());
                        }
                        if(intField.getValue() < min.get(i)){
                            min.put(i,intField.getValue());
                        }
                    }else if(tupleDesc.getFieldType(i) == Type.STRING_TYPE){
                        StringHistogram strHistogram = new StringHistogram(NUM_HIST_BINS);
                        this.strHistogramMap.put(i,strHistogram);
                    }
                }
            }
	// 初始化IntHistogram
            for(Integer integer: max.keySet()){
                IntHistogram intHistogram = new IntHistogram(NUM_HIST_BINS, min.get(integer), max.get(integer));
                this.intHistogramMap.put(integer,intHistogram);
            }
            this.tableTuples = tupleNum;
            iterator.rewind();
	// 设置桶大小
            while (iterator.hasNext()){
                Tuple next = iterator.next();
                for(int i=0;i<tupleDesc.numFields();i++){
                    if(tupleDesc.getFieldType(i) == Type.INT_TYPE){
                        IntField intField = (IntField) next.getField(i);
                        this.intHistogramMap.get(i).addValue(intField.getValue());
                    }else if(tupleDesc.getFieldType(i) == Type.STRING_TYPE){
                        StringField strField = (StringField) next.getField(i);
                        this.strHistogramMap.get(i).addValue(strField.getValue());
                    }
                }
            }
            if(databaseFile instanceof HeapFile)
                this.tablePages = ((HeapFile)databaseFile).numPages();
            else if(databaseFile instanceof BTreeFile)
                this.tablePages = ((BTreeFile)databaseFile).numPages();
            else{
                this.tablePages = 0;
                throw new RuntimeException("can not confirm the page");
            }


        } catch (DbException e) {
            throw new RuntimeException(e);
        } catch (TransactionAbortedException e) {
            throw new RuntimeException(e);
        }

    }

实验3

文档指出计算joincost((t1 join t2) join t3)这种多连接查询的花费时，首先需要估算(t1 join t2)操作返回的结果集数量，但这一操作较为复杂，本实验中只要求我们做个大概估算，估算时需要遵循以下三点：

对于等值连接t1 join t2 on t1.i=t2.j，当i是t1中的主键时，估算的结果不能大于t2中j列非空的记录条数；
对于不存在主键的等值连接，可以直接将较大表的记录条数作为估算结果；
对于范围连接，可以直接将两表记录数的乘积*30%作为估算结果；

节点类

LogicalJoinNode：存放两表连接信息，如t1 join t2 on t1.i=t2.j
LogicalSubplanJoinNode：继承了LogicalJoinNode，存放多表连接信息，如(t1 join t2) join t3
LogicalScanNode：存放查询的表信息，如select * from t中的表t
LogicalFilterNode：存放where子句的过滤信息，如where t1.i = 1
LogicalSelectListNode：存放查询的字段信息，如select max(c) form t1中的max(c)

 public double estimateJoinCost(LogicalJoinNode j, int card1, int card2,
            double cost1, double cost2) {
        if (j instanceof LogicalSubplanJoinNode) {
            // lab3对子查询不要求，使用默认实现
            return card1 + cost1 + cost2;
        } else {
            double ioCost = cost1 + card1 * cost2;
            double cpuCost = card1 * card2;
            return ioCost + cpuCost;
        }
    }

public static int estimateTableJoinCardinality(Predicate.Op joinOp,
                                                   String table1Alias, String table2Alias, String field1PureName,
                                                   String field2PureName, int card1, int card2, boolean t1pkey,
                                                   boolean t2pkey, Map<String, TableStats> stats,
                                                   Map<String, Integer> tableAliasToId) {
        switch (joinOp){
            case EQUALS:
                if(t1pkey || t2pkey)
                    return Math.min(card1,card2);
                else
                    return Math.max(card1,card2);
            default:
                return (int) (card1*card2*0.3);
        }
    }

实验4

Selinger算法获取最佳连接顺序：

1. j = set of join nodes	//j是一组连接节点
2. for (i in 1...|j|):	//从1遍历到j数组的长度
3.     for s in {all length i subsets of j} //生成所有长度为i的j数组的子集，遍历这些子集
4.       bestPlan = {}	// 用来存储该子集的最佳连接顺序
5.       for s' in {all join nodes of s} //遍历该子集中的所有连接节点
6.            subplan = optjoin(s-s') //从缓存取出去除s'节点后的最优连接 subPlan
7.            plan = best way to join s' to subplan //计算将节点s'加入subPlan的最优连接 plan
8.            if (cost(plan) < cost(bestPlan)) //如果plan的消耗小于bestPlan
9.               bestPlan = plan //将plan赋值给bestPlan
10.      optjoin(s) = bestPlan //把连接节点集合s的最优连接，放入缓存
11. return optjoin(j)

代码实现为：

public List<LogicalJoinNode> orderJoins(
            Map<String, TableStats> stats,
            Map<String, Double> filterSelectivities, boolean explain)
            throws ParsingException {
        PlanCache planCache = new PlanCache();
        for(int i=1;i<=joins.size();i++){
            Set<Set<LogicalJoinNode>> subsets = enumerateSubsets(joins, i);
            for(Set<LogicalJoinNode> subset:subsets){
                CostCard bestCostCard = new CostCard();
                bestCostCard.cost = Double.MAX_VALUE;
                for(LogicalJoinNode node:subset){
                    CostCard costCard = computeCostAndCardOfSubplan(stats, filterSelectivities, node, subset, bestCostCard.cost, planCache);
                    if(costCard == null)
                        continue;
                    else if(costCard.cost < bestCostCard.cost){
                        bestCostCard = costCard;
                    }
                }
                if(bestCostCard.cost != Double.MAX_VALUE)
                    planCache.addPlan(subset,bestCostCard.cost,bestCostCard.card,bestCostCard.plan);
            }
        }
        if (explain)
            printJoins(joins,planCache,stats,filterSelectivities);
        return planCache.getOrder(new HashSet<>(joins));
    }

注意computeCostAndCardOfSubplan()方法实现了伪代码的6，7两行，重要参数含义分别是：

stats：存放所有表的TableStats集合，key值是表名，用来计算表的cost
filterSelectivities：存放所有表的过滤度（即过滤操作后结果集占总集合的比例）集合，key值是表别名，是由LogicalPlan解析查询语句的过程中取得的，用来计算表的card
pc：存放所有长度连接数组对应的最优连接集合的缓存，以及该最优连接集合的cost和card

该方法在内部会计算交换连接顺序后的花费（比如 t1 join t2换成t2 join t1），以选出最优的连接方式

CostCard computeCostAndCardOfSubplan(
            Map<String, TableStats> stats,
            Map<String, Double> filterSelectivities,
            LogicalJoinNode joinToRemove, Set<LogicalJoinNode> joinSet,
            double bestCostSoFar, PlanCache pc) throws ParsingException

CostCard类存储一个join连接的数组，以及该数组进行连接所产生的cost和card

标签：join,int,t2,t1,Lab3,cost,MIT6.830,连接
From： https://www.cnblogs.com/rockdow/p/18045432

概览

实验部分

实验1

实验2

实验3

节点类

实验4

相关文章

赞助商

阅读排行