首页 > 其他分享 >CORA Dataloader 分析

CORA Dataloader 分析

时间:2023-06-20 18:44:06浏览次数:38  
标签:分析 node CORA graph Dataloader feature dataset names subject

from .dataset_loader import DatasetLoader

class Cora(
    DatasetLoader,
    name="Cora",
    directory_name="cora",
    url="https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz",
    url_archive_format="gztar",
    expected_files=["cora.cites", "cora.content"],
    description="The Cora dataset consists of 2708 scientific publications classified into one of seven classes. "
    "The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector "
    "indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.",
    source="https://linqs.soe.ucsc.edu/data",
):

    _NUM_FEATURES = 1433

    def load(
        self,
        directed=False,
        largest_connected_component_only=False,
        subject_as_feature=False,
        edge_weights=None,
        str_node_ids=False,
    ):
        """
        Load this dataset into a homogeneous graph that is directed or undirected, downloading it if
        required.

        The node feature vectors are included, and the edges are treated as directed or undirected
        depending on the ``directed`` parameter.

        Args:
            directed (bool): if True, return a directed graph, otherwise return an undirected one.
            largest_connected_component_only (bool): if True, returns only the largest connected
                component, not the whole graph.
            edge_weights (callable, optional): a function that accepts three parameters: an
                unweighted StellarGraph containing node features, a Pandas Series of the node
                labels, a Pandas DataFrame of the edges (with `source` and `target` columns). It
                should return a sequence of numbers (e.g. a 1D NumPy array) of edge weights for each
                edge in the DataFrame.
            str_node_ids (bool): if True, load the node IDs as strings, rather than integers.
            subject_as_feature (bool): if True, the subject for each paper (node) is included in the
                node features, one-hot encoded (the subjects are still also returned as a Series).

        Returns:
            A tuple where the first element is the :class:`.StellarGraph` object (or
            :class:`.StellarDiGraph`, if ``directed == True``) with the nodes, node feature vectors
            and edges, and the second element is a pandas Series of the node subject class labels.
        """
        nodes_dtype = str if str_node_ids else int

        return _load_cora_or_citeseer(
            self,
            directed,
            largest_connected_component_only,
            subject_as_feature,
            edge_weights,
            nodes_dtype,
        )

def _load_cora_or_citeseer(
            dataset,
            directed,
            largest_connected_component_only,
            subject_as_feature,
            edge_weights,
            nodes_dtype,
    ):
    assert isinstance(dataset, (Cora, CiteSeer))

    if nodes_dtype is None:
        nodes_dtype = dataset._NODES_DTYPE

    dataset.download()

    # expected_files should be in this order
    cites, content = [dataset._resolve_path(name) for name in dataset.expected_files]

    feature_names = ["w_{}".format(ii) for ii in range(dataset._NUM_FEATURES)]
    subject = "subject"
    if subject_as_feature:
        feature_names.append(subject)
        column_names = feature_names
    else:
        column_names = feature_names + [subject]

    node_data = pd.read_csv(
        content, sep="\t", header=None, names=column_names, dtype={0: nodes_dtype}
    )

    edgelist = pd.read_csv(
        cites, sep="\t", header=None, names=["target", "source"], dtype=nodes_dtype
    )

    valid_source = node_data.index.get_indexer(edgelist.source) >= 0
    valid_target = node_data.index.get_indexer(edgelist.target) >= 0
    edgelist = edgelist[valid_source & valid_target]

    subjects = node_data[subject]

    cls = StellarDiGraph if directed else StellarGraph

    features = node_data[feature_names]
    if subject_as_feature:
        # one-hot encode the subjects
        features = pd.get_dummies(features, columns=[subject])

    graph = cls({"paper": features}, {"cites": edgelist})

    if edge_weights is not None:
        # A weighted graph means computing a second StellarGraph after using the unweighted one to
        # compute the weights.
        edgelist["weight"] = edge_weights(graph, subjects, edgelist)
        graph = cls({"paper": node_data[feature_names]}, {"cites": edgelist})

    if largest_connected_component_only:
        cc_ids = next(graph.connected_components())
        return graph.subgraph(cc_ids), subjects[cc_ids]

    return graph, subjects

标签:分析,node,CORA,graph,Dataloader,feature,dataset,names,subject
From: https://www.cnblogs.com/ZZXJJ/p/17494427.html

相关文章

  • 2023年衣物洗护市场行业分析(京东天猫数据分析)
    近年来,受消费者习惯的推动,衣物洗护用品市场不断发展,洗护用品行业的市场规模也不断增长。根据鲸参谋电商数据分析平台的相关数据显示,今年1月份至4月份,天猫平台上衣物洗护相关产品的销量为7300万+,产品销额高达31亿+。*数据源于鲸参谋-行业趋势分析伴随用户需求的多元化,洗护产品也越来......
  • 老财务人的财务数据分析经验技巧分享
    财务数据分析是个相当复杂艰难的话题,数据多、报表多、指标计算复杂多变,即使是经验丰富的财务人员都会觉得棘手。但做得多了,还是会累积大量的经验。接下来就来简单聊聊老财务人累积下来的那些财务数据分析经验与技巧。接下来,我们会从财务数据的对接、分析模型搭建、数据指标的计算与......
  • vue鼠标拖拽自定义指令实现过程和原理分析
    在Vue中,可以使用自定义指令来实现鼠标拖拽的功能。自定义指令允许我们在DOM元素上绑定特定的行为和逻辑。以下是一个实现鼠标拖拽的自定义指令的例子,同时也包含了相应的原理分析:<template><divv-draggable>DragMe!</div></template><script>exportdefault{directives......
  • 企业去O面临的6大难点及应对策略分析,告诉您为什么这么难?
    企业去O面临的6大难点及应对策略分析,告诉您为什么这么难?字数5476阅读5221评论1赞8去O的话题,可谓由来已久。从十年前阿里提出了这一口号,并率先在公司内部实现了数据库的整体去O开始,到后面从互联网公司到传统企业也纷纷跟进,可以说去O的理念已逐步深入人心。但到直到现......
  • 电力配电板行业市场现状调研及发展趋势分析报告2023-2029
    2023-2029全球电力配电板行业调研及趋势分析报告2022年全球电力配电板市场规模约66亿元,2018-2022年年复合增长率CAGR约为%,预计未来将持续保持平稳增长的态势,到2029年市场规模将接近91亿元,未来六年CAGR为4.3%。从核心市场看,中国电力配电板市场占据全球约%的市场份额,为全球最主要......
  • 电池接触系统行业市场现状调研及发展趋势分析报告2023-2029
    2023-2029全球电池接触系统行业调研及趋势分析报告2022年全球电池接触系统市场规模约23亿元,2018-2022年年复合增长率CAGR约为%,预计未来将持续保持平稳增长的态势,到2029年市场规模将接近33亿元,未来六年CAGR为4.6%。从核心市场看,中国电池接触系统市场占据全球约%的市场份额,为全球......
  • 便携式室外取暖器行业市场现状调研及发展趋势分析报告2023-2029
    2023-2029全球便携式室外取暖器行业调研及趋势分析报告2022年全球便携式室外取暖器市场规模约21亿元,2018-2022年年复合增长率CAGR约为%,预计未来将持续保持平稳增长的态势,到2029年市场规模将接近30亿元,未来六年CAGR为4.1%。从核心市场看,中国便携式室外取暖器市场占据全球约%的市......
  • 球NTC二极管热敏电阻行业市场现状调研及发展趋势分析报告2023-2029
    2023-2029全球NTC二极管热敏电阻行业调研及趋势分析报告2022年全球NTC二极管热敏电阻市场规模约3.5亿元,2018-2022年年复合增长率CAGR约为%,预计未来将持续保持平稳增长的态势,到2029年市场规模将接近3.9亿元,未来六年CAGR为1.7%。从核心市场看,中国NTC二极管热敏电阻市场占据全球约......
  • 改性聚苯醚IC托盘行业市场现状调研及发展趋势分析报告2023-2029
    2023-2029全球改性聚苯醚IC托盘行业调研及趋势分析报告2022年全球改性聚苯醚IC托盘市场规模约9.1亿元,2018-2022年年复合增长率CAGR约为%,预计未来将持续保持平稳增长的态势,到2029年市场规模将接近13亿元,未来六年CAGR为4.2%。从核心市场看,中国改性聚苯醚IC托盘市场占据全球约%的......
  • 【计算机算法设计与分析】最优子结构和贪心选择性质的证明
    最优子结构性质(反证法)计算某问题的最优解包含的计算该问题的子问题也是最优解。事实上,如果找到子问题的更优解,则可以替换当前子问题的解,得到一个比最优解更优的解,这是一个矛盾。贪心选择性质(数学归纳法)先设一个最优解(为所给定的总元素集合,且和均按照某种有利于算法贪心进行的顺序......