初识Spark

最近在看《Spark for python developer》, 在安装spark和集成notebook时遇到一点小问题，记录一下。
直接安装Spark编译后的最新版本，现在notebook已经被jupyter集成，也需要安装。然后需要设置两个环境变量如下

1 2	export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=8888"

这时再重新运行 ./bin/pyspark会看到控制台上notebook server log输出, 浏览器访问localhost:8888，就可以看到notebook页面了

接下来测试一下，统计shakespeare全集中的单词数量：

from operator import add

file_in = sc.textFile('/opt/shakespeare.txt')
# count lines
print('number of lines in file: %s' % file_in.count())
# add up lengths of each line
chars = file_in.map(lambda s: len(s)).reduce(add)
print('number of characters in file: %s' % chars)
# Get words from the input file
words =file_in.flatMap(lambda line: re.split('\W+', line.lower().strip()))
# words of more than 3 characters
words = words.filter(lambda x: len(x) > 3)
# set count 1 per word
words = words.map(lambda w: (w,1))
# reduce phase - sum count all the words
words = words.reduceByKey(add)

Happy Bigdata!