Getting Streaming data from twitter using Flume & then putting it in Hive to do analysis.
1. We are using cloudera VM so it is assumed that Java, Hadoop, Hive, Flume etc. is already installed. Things are up & running.
2. Also steps to create Twitter Project & getting consumerKey, consumerSecret, accessToken & accessTokenSecret is not covered. It is available in many blogs.
3. Configuration File which is used in Flume:
[cloudera@quickstart conf]$ pwd
/usr/lib/flume-ng/conf
[cloudera@quickstart conf]$ cat twitter.conf
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = ***
TwitterAgent.sources.Twitter.consumerSecret = *****
TwitterAgent.sources.Twitter.accessToken = ****
TwitterAgent.sources.Twitter.accessTokenSecret = ****
TwitterAgent.sources.Twitter.maxBatchSize = 50000
TwitterAgent.sources.Twitter.maxBatchDurationMillis = 100000
TwitterAgent.sources.Twitter.keywords = @narendramodi
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://quickstart.cloudera/user/cloudera/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
# Describing/Configuring the channel TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 1000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
# Binding the source and sink to the channel
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel
4. Then we have to run this config file using following command:
./flume-ng agent -n TwitterAgent -c conf -f ../conf/twitter.conf
at location: /usr/lib/apache-flume-1.6.0-bin/bin
5. As per our configuration file sink output path we can see files getting generated at location:
hdfs://quickstart.cloudera/user/cloudera/twitter_data/*
By default these files are in avro file format, so we will use avro-tools-1.7.7.jar to generate schema which will be used to build hive table.
run command as:
C:\Users\ankit.baldua.AVULZE0\Desktop\flume>java -jar avro-tools-1.7.7.jar getschema FlumeData.1476876054060.avro > Flumedata.avsc
6. Connect to hive shell & create table as below, put above generated avsc file in correct path.
CREATE EXTERNAL TABLE tweetsavro
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’
STORED AS INPUTFORMAT
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’
TBLPROPERTIES (‘avro.schema.url’=’hdfs://quickstart.cloudera/user/cloudera/FlumeData.avsc’) ;
7. To put data in above created table:
LOAD DATA INPATH ‘/user/cloudera/twitter_data/*’ OVERWRITE INTO TABLE tweetsavro;
hive> describe tweetsavro;
OK
id string from deserializer
user_friends_count int from deserializer
user_location string from deserializer
user_description string from deserializer
user_statuses_count int from deserializer
user_followers_count int from deserializer
user_name string from deserializer
user_screen_name string from deserializer
created_at string from deserializer
text string from deserializer
retweet_count bigint from deserializer
retweeted boolean from deserializer
in_reply_to_user_id bigint from deserializer
source string from deserializer
in_reply_to_status_id bigint from deserializer
media_url_https string from deserializer
expanded_url string from deserializer
This means that schema is automatically created from avsc file. We can now load data.
8. Then this table is loaded with data & we can run select commands as usual.
hive> select * from tweetsavro limit 1;
OK
790880562726612992 97 NULL Kirsten Danielle Tan Delavin
Madeleine Yrenea Madayag #17
Hope Elizabeth Soberano 361 27 Sheina Delavin sheina_bastes 2016-10-25T04:39:54Z Balita hahahaha cute ni Dundun 0 false -1 Twitter for Android -1 NULL NULL
Time taken: 0.859 seconds, Fetched: 1 row(s)
hive>
9. In this way we can successfully load streaming twitter data into Hive.