Spark local hive metadata store
Skip to end of metadataBy default, spark will use embedded Derby database to store metadata, but if we don't config anything, it'll create the metadata_db folder for your current workspace.
If you want to share the metadata across two different folders, then we have to setup some properties in ${SPARK_HOME}/conf/hive-site.xml
file.
<configuration> <property> <name>hive.exec.scratchdir</name> <value>c:\tmp\hive\scratch</value> <description>Scratch space for Hive jobs</description> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:;databaseName=c:\tmp\hive\metastore_db;create=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.apache.derby.jdbc.EmbeddedDriver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>c:\tmp\hive\warehouse</value> </property> <property> <name>spark.sql.warehouse.dir</name> <value>c:\tmp\spark\warehouse</value> </property> </configuration>
Vscode config
"terminal.integrated.profiles.windows": { "PowerShell -NoProfile": { "source": "PowerShell", "args": [ "-NoProfile" ] }, "Git-Bash": { "path": "C:\\Program Files\\Git-2.31.1\\bin\\bash.exe", "args": [] } }, "terminal.integrated.defaultProfile.windows": "Git-Bash", "editor.fontSize": 15, "window.zoomLevel": 1, //The terminal enables the configuration by default }
Local Spark(PySpark) Environment Setup Edit Save for later Watch Share Dashboard… Tiger new joiner guide Skip to end of banner 208 views Go to start of banner Skip to end of metadata Created by Alan S K Zhang, last modified by Slevin Song_SP on Sep 09, 2022 Go to start of metadata Introduction This guide will help you to setup PySpark environment for debugging SparkSQL or PySpark code in your local machine. Environment Setup Install Anaconda Request service now request: https://hsbcitid.service-now.com/servicenow?id=sc_cat_item&sys_id=c6ff87b5dbc8f300e37db29f299619a7 Configurate pip Edit pip.ini by running: mkdir %APPDATA%\pip\ notepad %APPDATA%\pip\pip.ini Write pip.ini and save: [global] index-url=http://efx-nexus.systems.uk.hsbc:8083/nexus/repository/pypi.proxy/simple trusted-host=efx-nexus.systems.uk.hsbc Configurate Conda Channel Replace the content of C:\Users\Your_Staff_ID\ .condarc as following (If not exists, create it): channels: - http://gbl18133.systems.uk.hsbc:8080/conda/anaconda Install PySpark using Anaconda Open anaconda prompt and create a new environment by running be command, in this command we have specified the python version 3.7.9, we can not use the default python provided by Anaconda conda create -n pyspark_env python=3.7.9 Activate the new environment activate pyspark_env Conda install pyspark conda install pyspark The current PySpark may not be able to run due to the Py4j problem, please run the following: pip install --upgrade pyspark==3.2.1 pip install --upgrade py4j==0.10.9.3 Download winutils.exe hadoop-3.2.0 (1).zip Download the file and unzip it to a desired path. Download Spark and Unzip Download link (externel): https://www.apache.org/dyn/closer.lua/spark/spark-3.1.3/spark-3.1.3-bin-hadoop3.2.tgz So far only 3.1.3 is verified. Other version of Spark is not able to run so far. May need extra study to figure it out. You might need 7-Zip to unzip the file. Make sure you have unzip the file to a batch of file rather than a single tar file. Add hive-site.xml Create hive-site.xml in <Spark root folder>\conf\. Add the following content <property> <name>hive.exec.scratchdir</name> <value>c:\tmp\hive</value> <description>Scratch space for Hive jobs</description> </property> You can replace "c:\tmp\hive" to any path you like. Just make sure you create the folder before head. To make sure you gain enough access for the path, you can run the winutils.exe to grant the access cd <path that winutils.exe located> winutils chmod 777 c:\tmp winutils chmod 777 c:\tmp\hive
< configuration >
< property >
< name >hive.exec.scratchdir</ name >
< value >c:\tmp\hive\scratch</ value >
< description >Scratch space for Hive jobs</ description >
</ property >
< property >
< name >javax.jdo.option.ConnectionURL</ name >
< value >jdbc:derby:;databaseName=c:\tmp\hive\metastore_db;create=true</ value >
< description >JDBC connect string for a JDBC metastore</ description >
</ property >
< property >
< name >javax.jdo.option.ConnectionDriverName</ name >
< value >org.apache.derby.jdbc.EmbeddedDriver</ value >
< description >Driver class name for a JDBC metastore</ description >
</ property >
< property >
< name >hive.metastore.warehouse.dir</ name >
< value >c:\tmp\hive\warehouse</ value >
</ property >
< property >
< name >spark.sql.warehouse.dir</ name >
< value >c:\tmp\spark\warehouse</ value >
</ property >
</ configuration >
|