首页 > 其他分享 >local spark config

local spark config

时间:2022-10-29 22:13:03浏览次数:70  
标签:tmp name config hive pip spark property local

Spark local hive metadata store

  Skip to end of metadata  

By default, spark will use embedded Derby database to store metadata, but if we don't config anything, it'll create the metadata_db folder for your current workspace.

If you want to share the metadata across two different folders, then we have to setup some properties in ${SPARK_HOME}/conf/hive-site.xml file.

 

<configuration>
  <property>
    <name>hive.exec.scratchdir</name>
    <value>c:\tmp\hive\scratch</value>
    <description>Scratch space for Hive jobs</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:;databaseName=c:\tmp\hive\metastore_db;create=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.apache.derby.jdbc.EmbeddedDriver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>c:\tmp\hive\warehouse</value>
  </property>
  <property>
    <name>spark.sql.warehouse.dir</name>
    <value>c:\tmp\spark\warehouse</value>
  </property>
</configuration>

Vscode config

    "terminal.integrated.profiles.windows": {
        "PowerShell -NoProfile": {
          "source": "PowerShell",
          "args": [
            "-NoProfile"
          ]
        },
        "Git-Bash": {
          "path": "C:\\Program Files\\Git-2.31.1\\bin\\bash.exe",
          "args": []
        }
    },
      "terminal.integrated.defaultProfile.windows": "Git-Bash",
      "editor.fontSize": 15,
      "window.zoomLevel": 1, //The terminal enables the configuration by default
}
 
Local Spark(PySpark) Environment Setup
 Edit Save for later Watch Share
Dashboard… Tiger new joiner guide
Skip to end of banner
208 views
Go to start of banner
Skip to end of metadata
Created by Alan S K Zhang, last modified by Slevin Song_SP on Sep 09, 2022
Go to start of metadata
Introduction
This guide will help you to setup PySpark environment for debugging SparkSQL or PySpark code in your local machine.

Environment Setup 
Install Anaconda
Request service now request: https://hsbcitid.service-now.com/servicenow?id=sc_cat_item&sys_id=c6ff87b5dbc8f300e37db29f299619a7

Configurate pip
Edit pip.ini by running:

mkdir %APPDATA%\pip\
notepad %APPDATA%\pip\pip.ini


Write pip.ini and save:

[global]
index-url=http://efx-nexus.systems.uk.hsbc:8083/nexus/repository/pypi.proxy/simple
trusted-host=efx-nexus.systems.uk.hsbc
Configurate Conda Channel
Replace the content of C:\Users\Your_Staff_ID\ .condarc  as following (If not exists, create it):

channels:
- http://gbl18133.systems.uk.hsbc:8080/conda/anaconda
Install PySpark using Anaconda
Open anaconda prompt and create a new environment by running be command, in this command we have specified the python version 3.7.9, we can not use the default python provided by Anaconda

conda create -n pyspark_env python=3.7.9


Activate the new environment



activate pyspark_env
Conda install pyspark

conda install pyspark
The current PySpark may not be able to run due to the Py4j problem, please run the following:

pip install --upgrade pyspark==3.2.1
pip install --upgrade py4j==0.10.9.3
Download winutils.exe 
hadoop-3.2.0 (1).zip

Download the file and unzip it to a desired path.

Download Spark and Unzip
Download link (externel): https://www.apache.org/dyn/closer.lua/spark/spark-3.1.3/spark-3.1.3-bin-hadoop3.2.tgz

So far only 3.1.3 is verified. Other version of Spark is not able to run so far. May need extra study to figure it out.

You might need 7-Zip to unzip the file. Make sure you have unzip the file to a batch of file rather than a single tar file.

Add hive-site.xml 
Create hive-site.xml in <Spark root folder>\conf\. Add the following content

<property>
   <name>hive.exec.scratchdir</name>
   <value>c:\tmp\hive</value>
   <description>Scratch space for Hive jobs</description>
 </property>
You can replace "c:\tmp\hive" to any path you like. Just make sure you create the folder before head.

To make sure you gain enough access for the path, you can run the winutils.exe to grant the access

cd <path that winutils.exe located>
winutils chmod 777 c:\tmp
winutils chmod 777 c:\tmp\hive

 

 

 

<configuration>   <property>     <name>hive.exec.scratchdir</name>     <value>c:\tmp\hive\scratch</value>     <description>Scratch space for Hive jobs</description>   </property>   <property>     <name>javax.jdo.option.ConnectionURL</name>     <value>jdbc:derby:;databaseName=c:\tmp\hive\metastore_db;create=true</value>     <description>JDBC connect string for a JDBC metastore</description>   </property>   <property>     <name>javax.jdo.option.ConnectionDriverName</name>     <value>org.apache.derby.jdbc.EmbeddedDriver</value>     <description>Driver class name for a JDBC metastore</description>   </property>   <property>     <name>hive.metastore.warehouse.dir</name>     <value>c:\tmp\hive\warehouse</value>   </property>   <property>     <name>spark.sql.warehouse.dir</name>     <value>c:\tmp\spark\warehouse</value>   </property> </configuration>
 

标签:tmp,name,config,hive,pip,spark,property,local
From: https://www.cnblogs.com/wangjiahua/p/16840006.html

相关文章