首页 > 编程语言 >python spark 随机森林入门demo

python spark 随机森林入门demo

时间:2023-06-01 14:33:01浏览次数:45  
标签:... 1.0 python demo LabeledPoint numTrees 0.0 spark model

class pyspark.mllib.tree.RandomForest[source]

Learning algorithm for a random forest model for classification or regression.

New in version 1.2.0.

supportedFeatureSubsetStrategies = ('auto', 'all', 'sqrt', 'log2', 'onethird')classmethod trainClassifier(datanumClassescategoricalFeaturesInfonumTreesfeatureSubsetStrategy='auto'impurity='gini'maxDepth=4maxBins=32seed=None)[source]


Train a random forest model for binary or multiclass classification.

Parameters:

  • data – Training dataset: RDD of LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}.
  • numClasses – Number of classes for classification.
  • categoricalFeaturesInfo – Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.
  • numTrees – Number of trees in the random forest.
  • featureSubsetStrategy – Number of features to consider for splits at each node. Supported values: “auto”, “all”, “sqrt”, “log2”, “onethird”. If “auto” is set, this parameter is set based on numTrees: if numTrees == 1, set to “all”; if numTrees > 1 (forest) set to “sqrt”. (default: “auto”)
  • impurity – Criterion used for information gain calculation. Supported values: “gini” or “entropy”. (default: “gini”)
  • maxDepth – Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 4)
  • maxBins – Maximum number of bins used for splitting features. (default: 32)
  • seed – Random seed for bootstrapping and choosing feature subsets. Set as None to generate seed based on system time. (default: None)

Returns:

RandomForestModel that can be used for prediction.

Example usage:

>>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.mllib.tree import RandomForest
>>>
>>> data = [
...     LabeledPoint(0.0, [0.0]),
...     LabeledPoint(0.0, [1.0]),
...     LabeledPoint(1.0, [2.0]),
...     LabeledPoint(1.0, [3.0])
... ]
>>> model = RandomForest.trainClassifier(sc.parallelize(data), 2, {}, 3, seed=42)
>>> model.numTrees()
3
>>> model.totalNumNodes()
7
>>> print(model)
TreeEnsembleModel classifier with 3 trees

>>> print(model.toDebugString())
TreeEnsembleModel classifier with 3 trees

  Tree 0:
    Predict: 1.0
  Tree 1:
    If (feature 0 <= 1.0)
     Predict: 0.0
    Else (feature 0 > 1.0)
     Predict: 1.0
  Tree 2:
    If (feature 0 <= 1.0)
     Predict: 0.0
    Else (feature 0 > 1.0)
     Predict: 1.0

>>> model.predict([2.0])
1.0
>>> model.predict([0.0])
0.0
>>> rdd = sc.parallelize([[3.0], [1.0]])
>>> model.predict(rdd).collect()
[1.0, 0.0]

New in version 1.2.0.

 

摘自:https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTree

标签:...,1.0,python,demo,LabeledPoint,numTrees,0.0,spark,model
From: https://blog.51cto.com/u_11908275/6393834

相关文章

  • python spark 决策树 入门demo
    Refertothe DecisionTree and DecisionTreeModel formoredetailsontheAPI.frompyspark.mllib.treeimportDecisionTree,DecisionTreeModelfrompyspark.mllib.utilimportMLUtils#LoadandparsethedatafileintoanRDDofLabeledPoint.data=MLUtils.l......
  • python spark 求解最大 最小 平均
    rdd=sc.parallelizeDoubles(testData);Nowwe’llcalculatethemeanofourdataset. 1LOGGER.info("Mean:"+rdd.mean());Therearesimilarmethodsforotherstatisticsoperationsuchasmax,standarddeviation,…etc.Everytimeoneofthismethodisin......
  • python spark kmeans demo
    官方的demofromnumpyimportarrayfrommathimportsqrtfrompysparkimportSparkContextfrompyspark.mllib.clusteringimportKMeans,KMeansModelsc=SparkContext(appName="clusteringExample")#Loadandparsethedatadata=sc.textFile("/......
  • spark Bisecting k-means(二分K均值算法)
    Bisectingk-means(二分K均值算法)    二分k均值(bisectingk-means)是一种层次聚类方法,算法的主要思想是:首先将所有点作为一个簇,然后将该簇一分为二。之后选择能最大程度降低聚类代价函数(也就是误差平方和)的簇划分为两个簇。以此进行下去,直到簇的数目等于用户给定的数目K为止。......
  • python dig 模拟—— DGA域名判定用
     #!/usr/bin/envpythonimportdns.resolver,sysdefget_domain_ip(domain):"""GettheDNSrecord,ifany,forthegivendomain."""dns_records=list()try:#getthednsresolutionsforthisdomain......
  • spark 常用参数和默认配置
    常用的Spark任务参数及其作用:spark.driver.memory:设置driver进程使用的内存大小,默认为1g。spark.executor.memory:设置每个executor进程使用的内存大小,默认为1g。spark.executor.cores:设置每个executor进程使用的CPU核数,默认为1。spark.default.parallelism:设置RDD的默......
  • 代码重复检查工具——python的使用CPD比较好用,clone digger针对py2
    代码重复检测:cpd--minimum-tokens100--filesg:\source\python\--languagepython>log.txt输出类似:=====================================================================Founda381line(1849tokens)duplicationinthefollowingfiles:Startingatline24of......
  • python通过文件操作字典
    python通过文件操作字典python把字典保存到文件中python从文件中加载字典importjsonmy_dict={'Apple':4,'Banana':2,'Orange':6,'Grapes':11,'area1':[[23,56],[66,12],[68,89],[90,890]]}#保存文件tf=open("myDictionary.js......
  • 6道Python简单的测试题,你知道答案吗?
    学Python光掌握基础理论知识是不够的,我们需要将理论知识转化为实战技能,本篇文章小编为大家整理了6道Python简单的测试题,快来检测一下你的Python基础怎么样!1、以下代码的输出结果为:print(round(-3.6))A.-4B.-4.0C.-3D.-3.02、以下代码的输出结果为......
  • python mock使用
    Overviewmock 是一个用于单元测试的Python库,它使用mock模拟系统中如class,method等部分,并且断言它们是如何被调用的。在编写单元测试时,mock非常适合模拟数据库,web服务器等依赖外部的场景。本文是mock的入门篇,主要介绍mock的基本用法。除了mock外,还有许多其它的moc......