IntellijIDEA中配置依赖----Hadoop普通java项目

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

IntellijIDEA中配置依赖----Hadoop普通java项目
1.创建普通Java项目
2.添加引用
1.File – Project Structure... – Modules – Dependencies –+ – Library... – Java
2.选择/usr/local/Cellar/hadoop/2.5.2/libexec/share/hadoop 目录下除了httpfs外的全部文件夹。

(Ubuntu is : /usr/local/hadoop/share/hadoop/ )
可以随便写，例如”common”，OK。

4.+ – Jars or directories...
5.选择/usr/local/Cellar/hadoop/2.5.2/libexec/share/hadoop/common/l ib (Ubuntu is : /usr/local/hadoop/share/hadoop/common/lib ) 此时Dependencies内应该总共增加了一个”common”和一个”lib”目录。

3.修改Project Structure中的Artifacts，增加Jar包的生成配置。

生成HelloHadoop jar包
生成jar包的过程也比较简单，
1.选择菜单File->Project Structure，弹出Project Structure的设置对话框。

2.选择左边的Artifacts后点击上方的“+”按钮
3.在弹出的框中选择jar->from moduls with dependencies..
4.选择要启动的类，然后确定
5.应用之后，对话框消失。

在IDEA选择菜单Build->Build Artifacts,选择Build或者Rebuild后即可生成，生成的jar文件位于工程项目目录的out/artifacts下。

(这样对于LInux应该就可以了，但是对Mac OS x是不行的。

因
为OSX by default, the filesystem is set to case-insensitive . 一个解决方法是转用maven 来build project. 然后
可以写代码编译打包了。

编写WordCount MapReduce 程序
这里直接使用了官方的代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
57
Context context)
throws IOException, InterruptedException { int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception { Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setJarByClass(WordCount.class); //注意，必须添加这行，否则hadoop 无法找到对应的class
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);
}
}
需要注意的是需要在官方代码中加入job.setJarByClass(WordCount.class);这一行，具体解释可以参考这里。

运行HelloHadoop jar 包
将生成的HelloHadoop.jar 传送到hadoop 集群的Name node 节点上。

处于测试目的，简单写了一个测试数据文本wctest.txt
1 2 this is hadoop test string
hadoop hadoop
3 4 test test
string string string
将该测试文本传到HDFS
1 2 [hdfs@172-22-195-15 data]$ hdfs
dfs -mkdir /user/chenbiaolong/wc_test_input
[hdfs@172-22-195-15 data]$ hdfs dfs
-put wctest.txt
/user/chenbiaolong/wc_test_input cd 到jar 包对应的目录，执行HelloHadoop jar 包
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 [hdfs@172-22-195-15 code]$ cd WorkCount/
[hdfs@172-22-195-15 WorkCount]$ ls
HelloHadoop.jar
[hdfs@172-22-195-15 WorkCount]$ hadoop jar HelloHadoop.jar WordCount /user/chenbiaolong/wc_test_input /user/chenbiaolong/wc_test_output
15/03/26 15:54:19 INFO impl.TimelineClientImpl: Timeline service address: :8188/ws/v1/timeline/
15/03/26 15:54:19 INFO client.RMProxy: Connecting to ResourceManager at /172.22.195.17:8050
15/03/26 15:54:20 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/03/26 15:54:20 INFO input.FileInputFormat: Total input paths to process : 1
15/03/26 15:54:21 INFO mapreduce.JobSubmitter: number of splits:1 15/03/26 15:54:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1427255014010_0005
15/03/26 15:54:21 INFO impl.YarnClientImpl: Submitted application application_1427255014010_0005
15/03/26 15:54:21 INFO mapreduce.Job: The url to track the job: http://172-22-195-
:8088/proxy/application_1427255014010_0005/
15/03/26 15:54:21 INFO mapreduce.Job: Running job: job_1427255014010_0005
15/03/26 15:54:28 INFO mapreduce.Job: Job job_1427255014010_0005 running in uber mode : false
15/03/26 15:54:28 INFO mapreduce.Job: map 0% reduce 0%
15/03/26 15:54:34 INFO mapreduce.Job: map 100% reduce 0% 15/03/26 15:54:41 INFO mapreduce.Job: map 100% reduce 100%
15/03/26 15:54:42 INFO mapreduce.Job: Job job_1427255014010_0005 completed successfully
15/03/26 15:54:43 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=150
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 FILE: Number of bytes written=225815
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=210
HDFS: Number of bytes written=37
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=4133
Total time spent by all reduces in occupied slots (ms)=4793
Total time spent by all map tasks (ms)=4133 Total time spent by all reduce tasks (ms)=4793 Total vcore-seconds taken by all map tasks=4133 Total vcore-seconds taken by all reduce tasks=4793 Total megabyte-seconds taken by all map tasks=16928768
Total megabyte-seconds taken by all reduce tasks=19632128
Map-Reduce Framework
Map input records=4
Map output records=12
Map output bytes=120
Map output materialized bytes=150
Input split bytes=137
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=150
Reduce input records=12
Reduce output records=5
Spilled Records=24
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=91
CPU time spent (ms)=3040
Physical memory (bytes) snapshot=1466998784
Virtual memory (bytes) snapshot=8678326272 Total committed heap usage (bytes)=2200961024 Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=73
File Output Format Counters
Bytes Written=37
[hdfs@172-22-195-15 WorkCount]$
结果被输出到/user/chenbiaolong/wc_test_output
1 2 3 4 5 6 7 8 9 10 11 [hdfs@172-22-195-15 WorkCount]$ hdfs dfs -ls /user/chenbiaolong/wc_test_output
Found 2 items
-rw-r--r-- 3 hdfs hdfs 0 2015-03-26 15:54 /user/chenbiaolong/wc_test_output/_SUCCESS
-rw-r--r-- 3 hdfs hdfs 37 2015-03-26 15:54 /user/chenbiaolong/wc_test_output/part-r-00000
[hdfs@172-22-195-15 WorkCount]$ hdfs dfs -cat /user/chenbiaolong/wc_test_output/part-r-00000
hadoop 3
is 1
string 4
test 3
this 1
[hdfs@172-22-195-15 WorkCount]$
可以看出我们已经顺利得到正确结果。