1、数据导入

导入相关的jar包：

import org.apache.spark.ml.feature.PCA
import org.apache.spark.sql.Row
import org.apache.spark.ml.linalg.{Vector,Vectors}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.{Pipeline,PipelineModel}
import org.apache.spark.ml.feature.{IndexToString,StringIndexer,VectorIndexer,HashingTF,Tokenizer}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.classification.LogisticRegressionModel
import org.apache.spark.ml.classification.{BinaryLogisticRegressionSummary,LogisticRegression}
import org.apache.spark.sql.functions
import org.apache.spark.ml.tuning.{CrossValidator,CrossValidatorModel,ParamGridBuilder}

导入数据的代码：

import spark.implicits._
case class Adult(features: org.apache.spark.ml.linalg.Vector,label: String)
val df = sc.textFile("file:///data/adult.data").map(_.split(",")).map(p => Adult(Vectors.dense(p(0).toDouble,p(2).toDouble,p(4).toDouble, p(10).toDouble, p(11).toDouble, p(12).toDouble),p(14).toString())).toDF()

2、读取数据集信息

读取数据集代码：

val test = sc.textFile("file:///data/adult.test").map(_.split(",")).map(p => Adult(Vectors.dense(p(0).toDouble,p(2).toDouble,p(4).toDouble, p(10).toDouble, p(11).toDouble, p(12).toDouble), p(14).toString())).toDF()
val pca = new PCA().setInputCol("features").setOutputCol("pcaFeatures").setK(3).fit(df)
val result = pca.transform(df)
val testdata = pca.transform(test)
result.show(false)
testdata.show(false)

3、训练分类模型预测居民收入

预测居民收入：

val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(result)
labelIndexer.labels.foreach(println)
val featureIndexer = new VectorIndexer().setInputCol("pcaFeatures").setOutputCol("indexedFeatures").fit(result)
println(featureIndexer.numFeatures)
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
val lr = new LogisticRegression().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setMaxIter(100)
val lrPipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, lr, labelConverter))
val lrPipelineModel = lrPipeline.fit(result)
val lrModel = lrPipelineModel.stages(2).asInstanceOf[LogisticRegressionModel]
println("Coefficients:" + lrModel.coefficientMatrix+"Intercept:"+lrModel.interceptVector+"numClasses:"+lrModel.numClasses+"numFeatures: "+lrModel.numFeatures)
val lrPredictions = lrPipelineModel.transform(testdata)
val evaluator = new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction")
val lrAccuracy = evaluator.evaluate(lrPredictions)
println("Test Error = " + (1.0 - lrAccuracy))

4、超参数调优

代码：

val pca = new PCA().setInputCol("features").setOutputCol("pcaFeatures")
val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(df)
val featureIndexer = new VectorIndexer().setInputCol("pcaFeatures").setOutputCol("indexedFeatures")
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
val lr = new LogisticRegression().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setMaxIter(100)
val lrPipeline = new Pipeline().setStages(Array(pca, labelIndexer, featureIndexer, lr, labelConverter))
val paramGrid = new ParamGridBuilder().addGrid(pca.k, Array(1,2,3,4,5,6)).addGrid(lr.elasticNetParam, Array(0.2,0.8)).addGrid(lr.regParam, Array(0.01, 0.1, 0.5)).build()

标签：Mtlib,val,org,编程,import,apache,new,spark,Spark
From： https://www.cnblogs.com/liuzijin/p/17962484

【Leetcode 2474. 购买量严格增加的客户】MySQL用户变量编程解决严格递增连续子序列问
题目地址https://leetcode.cn/problems/customers-with-strictly-increasing-purchases/description/代码#WriteyourMySQLquerystatementbelowwitht1as(selectcustomer_id,year(order_date)asmy_year,sum(price)astotal_spendfromOrders......
Shell编程自动化之if、for、while和函数
一、if语句1.单分支格式if[条件判断式];then当条件判断成立时，执行的命令内容fiif[条件判断式]then当条件判断成立时，执行的命令内容fi2.双分支格式if[条件判断式];then当条件判断成立时，执行的命令内容else当条件判断......
Shell编程自动化之Shell编程基础
一、Shell可以用来做什么1.自动化批量系统初始化程序；2.自动化批量软件部署程序；3.应用程序管理；4.日志分析处理程序；5.自动化备份恢复程序；6.自动化信息采集及监控程序；7.自动化管理程序；二、Shell概述1.Shell是操作系统的外壳，是用户操作系统的命令接口，是一个介于用户和系统内核......
【LeetCode 2494. 合并在同一个大厅重叠的活动】MySQL用户变量编程解决区间合并问题
题目地址https://leetcode.cn/problems/merge-overlapping-events-in-the-same-hall/代码#WriteyourMySQLquerystatementbelowwitht2as(select*#----只需要改动这里的逻辑，其他不要动。注意里面的语句是“顺序执行的”-------------如果切换......
Spark版本不兼容导致Standalone集群无法连接问题
一、Spark版本不一致报错现象当使用client模式连接Spark的standalone集群时，报错所有的sparkmaster的节点都没有回应。二、问题排查思路通过client端的日志产看没有什么有价值的信息，需要看下spark端的master的日志，docker logsspark-master 产看docker容器spark-master......
【LeetCode1747. 应该被禁止的 Leetflex 账户】MySQL用户变量编程；尝试维护一个multise
题目地址https://leetcode.cn/problems/leetflex-banned-accounts/description/代码witht1as(selectaccount_id,ip_address,loginastick,"login"asmytypefromLogInfounionallselectaccount_id,ip_address,logoutastick......
《Java编程思想第四版》学习笔记53--关于UDP
1、TCP和UDP端口是相互独立的。也就是说，可以在端口8080同时运行一个TCP和UDP服务程序，两者之间不会产生冲突。P.547//:Dgram.java//Autilityclasstoconvertbackandforth//BetweenStringsandDataGramPackets.importjava.net.*;publicclassDgram{publ......
pyspark json数据解析
PySpark中的JSON数据解析在大数据处理中，JSON（JavaScriptObjectNotation）是一种常用的数据格式。它以易读的文本形式表示数据，常用于跨平台数据交换。在PySpark中，我们可以使用JSON数据作为输入，并使用内置的函数解析和处理这些数据。本文将介绍如何在PySpark中解析JSON数据，并提供相关......
《PySpark大数据分析实战》-14.云服务模式Databricks介绍基本概念
......
【LeetCode 2142. 每辆车的乘客人数 I】乘客到站等车，车到站载客。MySQL用户变量编程完
题目地址https://leetcode.cn/problems/the-number-of-passengers-in-each-bus-i/description/思路将所有关键时刻作为tick。（同一时刻车和人同时到，默认人在车前到）之后按照tick升序，使用MySQL用户变量编程完成计算逻辑。输出结果。代码withticksas(selectarrival_tim......

实验七：Spark机器学习库Mtlib编程实践

1、数据导入

2、读取数据集信息

3、训练分类模型预测居民收入

4、超参数调优

相关文章

赞助商

阅读排行