首页 > 其他分享 >Spark集成Hive

Spark集成Hive

时间:2023-06-02 20:46:04浏览次数:38  
标签:集成 java Hive hive sql apache org Spark spark

命令行集成Hive

将hive中的hive-site.xml配置文件拷贝到spark配置文件目录下,仅需要以下内容

<configuration>
  <property>
   <name>hive.metastore.warehouse.dir</name>
   <value>/user/hive/warehouse</value>
  </property>
  <property>
   <name>javax.jdo.option.ConnectionURL</name>
   <value>jdbc:mysql://ip:port/hive?serverTimezone=Asia/Shanghai</value>
  </property>
  <property>
   <name>javax.jdo.option.ConnectionDriverName</name>
   <value>com.mysql.cj.jdbc.Driver</value>
  </property>
  <property>
   <name>javax.jdo.option.ConnectionUserName</name>
   <value>root</value>
  </property>
  <property>
   <name>javax.jdo.option.ConnectionPassword</name>
   <value>xxx</value>
  </property>
</configuration>

将hive中lib下的mysql渠道包拷贝到spark的jars目录下

bin/spark-sql

这样就可以像操作hive一样操作spark-sql了。

insert into tb_spark(name,age) values('lisi',23); # hive写法
insert into tb_spark values('lisi',23); # sparksql写法

插入数据时不能指定列名,原因未知,可能版本的问题。

代码集成Hive

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.11</artifactId>
    <version>2.4.3</version>
</dependency>
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>8.0.29</version>
</dependency>
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
  * sparkSQL操作hive
  */
object SparkSQLReadHive {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setMaster("local")

    val sparkSession = SparkSession.builder()
      .appName("SparkSQLReadHive")
      .config(conf)
      .config("spark.sql.warehouse.dir", "hdfs://bigdata01:9000/user/hive/warehouse")
      .enableHiveSupport()
      .getOrCreate()
    
    sparkSession.sql("select * from student").show()

    sparkSession.stop()
  }
}

报错

Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOException: (null) entry in command string: null ls -F C:\tmp\hive
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
	at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
	at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:587)
	at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:562)
	at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:599)
	at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
	at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
	at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)

解决方法

  1. 在本地下载hadoop并解压
  2. 并下载 winutils.exe,放到hadoop的bin目录下。
  3. 配置HADOOP_HOME环境变量或者在代码中配置
    System.setProperty("hadoop.home.dir","C:\\D-myfiles\\software\\hadoop-3.2.0\\hadoop-3.2.0")
    

又报错

Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-;
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
	at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
	at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
	at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
	at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141)
	at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91)

参考网上,要执行

winutils.exe chmod 777 C:\tmp\hive

但报错

由于找不到MSVCR100.dll,无法继续执行代码

太麻烦,暂时先不管它了。

参考

解决windows上The root scratch dir: /tmp/hive on HDFS should be writable.Current permissions are: ------
本地spark连接hive相关问题总结

标签:集成,java,Hive,hive,sql,apache,org,Spark,spark
From: https://www.cnblogs.com/strongmore/p/17366811.html

相关文章

  • Spark3.x扩展内容
    3.0.0主要的新特性:在TPC-DS基准测试中,通过启用自适应查询执行、动态分区裁剪等其他优化措施,相比于Spark2.4,性能提升了2倍兼容ANSISQL对pandasAPI的重大改进,包括python类型hints及其他的pandasUDFs简化了Pyspark异常,更好的处理Pythonerrorstructuredstreaming的新UI在......
  • IDEA集成Java性能分析神器JProfiler
    阅读文本大概需要10分钟。《eclipse集成Java性能分析神器JProfiler》讲解了eclipse集成Jprofiler,这篇讲解一下IDEA如何集成JProfiler。1、在JProfiler中配置IDEA选择IDEA2019这里并不同于Eclipse选择Eclipse的安装目录。IDEA选择的是配置目录,啥为配置目录了呢?其实就是在配置JProfi......
  • eclipse集成Java性能分析神器JProfiler
    JProfiler是一款功能强大的Java开发分析工具,能帮助从事编程工作的朋友们分析你们的代码数据,确定内存泄漏并了解线程问题。1、下载JProfilerhttps://www.ej-technologies.com/download/jprofiler/files可以根据系统位数选择具体版本的JProfiler2、安装JProfiler下载绿色版本JPro......
  • VMware ESXi 8.0U1a Unlocker & OEM BIOS 集成网卡驱动和 NVMe 驱动 (集成驱动版)
    VMwareESXi8.0Update1aUnlocker&OEMBIOS集成网卡驱动和NVMe驱动(集成驱动版)发布ESXi8.0U1集成驱动版,在个人电脑上运行企业级工作负载请访问原文链接:https://sysin.org/blog/vmware-esxi-8-u1-sysin/,查看最新版。原创作品,转载请保留出处。作者主页:sysin.orgES......
  • 为什么使用Flink替代Spark?
    一,Flink是真正的流处理,延迟在毫秒级,SparkStreaming是微批,延迟在秒级。二,Flink可以处理事件时间,而SparkStreaming只能处理机器时间,无法保证时间语义的正确性。三,Flink的检查点算法比SparkStreaming更加灵活,性能更高。SparkStreaming的检查点算法是在每个stage结束以后,才会保......
  • Spring Cloud开发实践(七): 集成Consul配置中心
    目录SpringCloud开发实践(一):简介和根模块SpringCloud开发实践(二):Eureka服务和接口定义SpringCloud开发实践(三):接口实现和下游调用SpringCloud开发实践(四):Docker部署SpringCloud开发实践(五):Consul-服务注册的另一个选择SpringCloud开发实践(六):基......
  • Jenkins集成sonarqube报错
    报错1SONARANALYSISFAILED------------------------------------------------------------------------FATAL:SonarQubeScannerexecutablewasnotfoundforsonarscannerERROR:SonarQubeScannerexecutablewasnotfoundforsonarscannerFinished:FAILURE说明......
  • hive - beeline命令行可以使用的命令
       beeline>!help!allExecutethespecifiedSQLagainstallthecurrentconnections!autocommitSetautocommitmodeonoroff!batchStartorexecuteabatchofstatements!briefSetverbosemodeoff......
  • 英特尔深度学习框架BigDL——a distributed deep learning library for Apache Spark
    BigDL:DistributedDeepLearningonApacheSparkWhatisBigDL?BigDLisadistributeddeeplearninglibraryforApacheSpark;withBigDL,userscanwritetheirdeeplearningapplicationsasstandardSparkprograms,whichcandirectlyrunontopofexisting......
  • Spark技术在京东智能供应链预测的应用——按照业务进行划分,然后利用scikit learn进行
    3.3Spark在预测核心层的应用我们使用SparkSQL和SparkRDD相结合的方式来编写程序,对于一般的数据处理,我们使用Spark的方式与其他无异,但是对于模型训练、预测这些需要调用算法接口的逻辑就需要考虑一下并行化的问题了。我们平均一个训练任务在一天处理的数据量大约在500G左右,虽然数......