python spark 随机森林入门demo

时间：2023-06-01 14:33:01浏览次数：46

标签：... 1.0 python demo LabeledPoint numTrees 0.0 spark model

class pyspark.mllib.tree.RandomForest[source]

Learning algorithm for a random forest model for classification or regression.

New in version 1.2.0.

supportedFeatureSubsetStrategies = ('auto', 'all', 'sqrt', 'log2', 'onethird')classmethod trainClassifier(data, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None)[source]

Train a random forest model for binary or multiclass classification.

Parameters:	data – Training dataset: RDD of LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}. numClasses – Number of classes for classification. categoricalFeaturesInfo – Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}. numTrees – Number of trees in the random forest. featureSubsetStrategy – Number of features to consider for splits at each node. Supported values: “auto”, “all”, “sqrt”, “log2”, “onethird”. If “auto” is set, this parameter is set based on numTrees: if numTrees == 1, set to “all”; if numTrees > 1 (forest) set to “sqrt”. (default: “auto”) impurity – Criterion used for information gain calculation. Supported values: “gini” or “entropy”. (default: “gini”) maxDepth – Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 4) maxBins – Maximum number of bins used for splitting features. (default: 32) seed – Random seed for bootstrapping and choosing feature subsets. Set as None to generate seed based on system time. (default: None)
Returns:	RandomForestModel that can be used for prediction.

Example usage:

>>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.mllib.tree import RandomForest
>>>
>>> data = [
...     LabeledPoint(0.0, [0.0]),
...     LabeledPoint(0.0, [1.0]),
...     LabeledPoint(1.0, [2.0]),
...     LabeledPoint(1.0, [3.0])
... ]
>>> model = RandomForest.trainClassifier(sc.parallelize(data), 2, {}, 3, seed=42)
>>> model.numTrees()
3
>>> model.totalNumNodes()
7
>>> print(model)
TreeEnsembleModel classifier with 3 trees

>>> print(model.toDebugString())
TreeEnsembleModel classifier with 3 trees

  Tree 0:
    Predict: 1.0
  Tree 1:
    If (feature 0 <= 1.0)
     Predict: 0.0
    Else (feature 0 > 1.0)
     Predict: 1.0
  Tree 2:
    If (feature 0 <= 1.0)
     Predict: 0.0
    Else (feature 0 > 1.0)
     Predict: 1.0

>>> model.predict([2.0])
1.0
>>> model.predict([0.0])
0.0
>>> rdd = sc.parallelize([[3.0], [1.0]])
>>> model.predict(rdd).collect()
[1.0, 0.0]

New in version 1.2.0.

摘自：https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTree

标签：...,1.0,python,demo,LabeledPoint,numTrees,0.0,spark,model
From： https://blog.51cto.com/u_11908275/6393834

python spark 决策树入门demo
Refertothe DecisionTree and DecisionTreeModel formoredetailsontheAPI.frompyspark.mllib.treeimportDecisionTree,DecisionTreeModelfrompyspark.mllib.utilimportMLUtils#LoadandparsethedatafileintoanRDDofLabeledPoint.data=MLUtils.l......
python spark 求解最大最小平均
rdd=sc.parallelizeDoubles(testData);Nowwe’llcalculatethemeanofourdataset. 1LOGGER.info("Mean:"+rdd.mean());Therearesimilarmethodsforotherstatisticsoperationsuchasmax,standarddeviation,…etc.Everytimeoneofthismethodisin......
python spark kmeans demo
官方的demofromnumpyimportarrayfrommathimportsqrtfrompysparkimportSparkContextfrompyspark.mllib.clusteringimportKMeans,KMeansModelsc=SparkContext(appName="clusteringExample")#Loadandparsethedatadata=sc.textFile("/......
spark Bisecting k-means（二分K均值算法）
Bisectingk-means（二分K均值算法）二分k均值（bisectingk-means）是一种层次聚类方法，算法的主要思想是：首先将所有点作为一个簇，然后将该簇一分为二。之后选择能最大程度降低聚类代价函数（也就是误差平方和）的簇划分为两个簇。以此进行下去，直到簇的数目等于用户给定的数目K为止。......
python dig 模拟—— DGA域名判定用
#!/usr/bin/envpythonimportdns.resolver,sysdefget_domain_ip(domain):"""GettheDNSrecord,ifany,forthegivendomain."""dns_records=list()try:#getthednsresolutionsforthisdomain......
spark 常用参数和默认配置
常用的Spark任务参数及其作用：spark.driver.memory：设置driver进程使用的内存大小，默认为1g。spark.executor.memory：设置每个executor进程使用的内存大小，默认为1g。spark.executor.cores：设置每个executor进程使用的CPU核数，默认为1。spark.default.parallelism：设置RDD的默......
代码重复检查工具——python的使用CPD比较好用，clone digger针对py2
代码重复检测：cpd--minimum-tokens100--filesg:\source\python\--languagepython>log.txt输出类似：=====================================================================Founda381line(1849tokens)duplicationinthefollowingfiles:Startingatline24of......
python通过文件操作字典
python通过文件操作字典python把字典保存到文件中python从文件中加载字典importjsonmy_dict={'Apple':4,'Banana':2,'Orange':6,'Grapes':11,'area1':[[23,56],[66,12],[68,89],[90,890]]}#保存文件tf=open("myDictionary.js......
6道Python简单的测试题，你知道答案吗?
学Python光掌握基础理论知识是不够的，我们需要将理论知识转化为实战技能，本篇文章小编为大家整理了6道Python简单的测试题，快来检测一下你的Python基础怎么样!1、以下代码的输出结果为：print(round(-3.6))A.-4B.-4.0C.-3D.-3.02、以下代码的输出结果为......
python mock使用
Overviewmock 是一个用于单元测试的Python库，它使用mock模拟系统中如class,method等部分，并且断言它们是如何被调用的。在编写单元测试时，mock非常适合模拟数据库，web服务器等依赖外部的场景。本文是mock的入门篇，主要介绍mock的基本用法。除了mock外，还有许多其它的moc......

python spark 随机森林入门demo

相关文章

赞助商

阅读排行