首先,准备MacOS环境
略过Java、Scala、Python的环境安装,从Hadoop和Spark说起
安装Hadoop
安装Hadoop,最简单的安装方式:
找到安装目录
安装完成后,找到Hadoop配置文件目录:
1
| cd /usr/local/Cellar/hadoop/2.7.3/libexec/etc/hadoop
|
修改core-site.xml
1 2 3 4 5 6 7 8 9 10
| <configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/Cellar/hadoop/2.7.3/libexec/tmp</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:8020</value> </property> </configuration>
|
修改hdfs-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/Cellar/hadoop/2.7.3/libexec/tmp/dfs/name</value> </property> <property> <name>dfs.namenode.data.dir</name> <value>file:/usr/local/Cellar/hadoop/2.7.3/libexec/tmp/dfs/data</value> </property> </configuration>
|
添加环境变量
1 2 3
| export HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.3/libexec export PATH=$PATH:${HADOOP_HOME}/bin
|
格式化HDFS
1 2
| cd /usr/local/Cellar/hadoop/2.7.3/bin ./hdfs namenode -format
|
启动Hadoop
1 2
| cd /usr/local/Cellar/hadoop/2.7.3/sbin ./start-all.sh
|
在终端输入 jps 查看java进程
1 2 3
| 1206 DataNode 1114 NameNode 1323 SecondaryNameNode
|
安装Spark
Spark的安装也是使用 brew
1
| brew install apache-spark
|
找到安装目录
找到Spark配置文件目录
1
| cd /usr/local/Cellar/apache-spark/2.1.0/libexec/conf
|
修改spark-env.sh
1 2 3 4
| cp spark-env.sh.template spark-env.sh vi spark-env.sh export SPARK_HOME=/usr/local/Cellar/apache-spark/2.1.0/libexec export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
|
加入环境变量
1 2
| export SPARK_HOME=/usr/local/Cellar/apache-spark/2.1.0/libexec export PATH=$PATH:${SPARK_HOME}/bin
|
启动Spark
1 2
| cd /usr/local/Cellar/apache-spark/1.6.0/bin ./start-all.sh
|
查看进程
1 2 3 4 5 6 7 8 9 10
| jps
6052 Worker 6022 Master 6728 Jps 5546 NameNode 5739 SecondaryNameNode 5947 NodeManager 5630 DataNode 5855 ResourceManager
|
配置Pycharm开发spark应用
打开Pycharm(我的python版本是2.7)
新建xxxx,新建类:一个简单的wordcount
1 2 3 4 5 6 7 8 9 10 11
| from pyspark import SparkContext
logFile = "/Users/admin/Desktop/BackUp" sc = SparkContext("local","Simple App") logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count() numBs = logData.filter(lambda s: 'b' in s).count()
print("Lines with a: %i, lines with b: %i"%(numAs, numBs))
|
F4打开当前可运行代码的配置项
Environment Variables 选项填写:
1
| PYTHONPATH /usr/local/Cellar/apache-spark/2.1.0/libexec/python
|
至此,环境完成。