Project #1: Tweet Search

Automated ways of searching text are now used across the Web, within institutions, and across critical services like the healthcare industry. This project will have you write a basic search tool to filter large amounts of textual data in a fraction of a second. It would take a human days (weeks, months) to do the same thing by hand.

We are providing to you almost 200 thousand actual public tweets from Twitter. Your program will read them and provide a search interface. Dr. Chambers has a continual feed into Twitter that downloads millions of tweets every day, so this is just a fraction of a fraction of the data that we have here in the CS department. We use it to do fundamental research in artificial intelligence and information extraction, such as tracking international relations or DDoS attacks on networks. This short project will let you poke around the data.

Due date: 2359 on 10 Feb 2020

Honor: see course policy. You may not help with nor discuss any aspect of this project with anybody (except your instructor and MGSP).


Disclaimer
You may not distribute this project's data to anyone beyond the USNA. Our agreement with Twitter prevents copying and distributing.
This is raw, real world data. The standard disclaimer applies as it does whenever we step out onto the Web. You may come across offensive material. Please behave like mature adults and future officers, as appropriate.


Step 1 (50 pts): Read and Print

Input Data

Input to your project's code will come from a single text file. The file contains one tweet per line. Each line contains a single tweet's 3 data fields (tweet, username, date) separated by tab ('\t') characters. You must read this file, create Tweet objects, and store them in an array. Here is an example of a line in the file:

omg i can't believe you said that!!! #whoknows       stargazerz      2013-03-20

Do this now: create a Project1/ directory for your code.
Download the two tweet data files into your new directory: all tweets (15mb) and some tweets (2k)
Peek inside and take a look at the data.

Create a Tweet class.

You must write a class definition for a Tweet that appropriately stores the different pieces of information like the text, username, year, month, and day. Do not store the entire line as one String. You must follow proper Data Hiding techniques.

Make a constructor that uses this exact prototype definition:

public Tweet(String newtext, String newuser, String newdate) {}

Remember that the constructor's sole job is to initialize the variables in the Tweet object. You will likely need to split that String date into three int values. How do we do that in Java? See the String .split() method to break it up into 3 parts. This will result in a String for the year like "2013". To convert this to an int, Java provides a static method that you can call:

Integer.parseInt("2013") ==> 2013

Finally, create a member method in Tweet: String toString(). This method returns a single String that represents all of the Tweet's data. The method does not print anything, it simply builds and returns a String. What should the string look like? Follow this format (use a tab character '\t' to separate the 3 output fields):

tweet text goes first     [usernameInBrackets]    1/30/2013

Create your main program.

Create Search.java with a main() method. Your main method will get the file path from the command line, and then call another method:

Tweet[] readFile(String path) {}

You must write readFile. Where is the proper place to define this method? To return an array, you obviously need to initialize it to a certain size. We will hardcode the size for this step: size 33 (this is how many tweets are in the file sometweets.txt).
Then write code to open the file, read its lines, and fill the Tweet[] array.

Expected Output

Make sure your running program looks exactly like the following. Note that the date format is different from the input. The year is listed last, and all fields are separated by slashes:

> java Search
usage: java Search <tweets-file>
> java Search sometweets.txt
Array size: 33
i kicked daniels knee  [st0rmcl0aks]  8/11/2013
rt @jvson_: i think it's horrible that people feel embarrassed to take btec now because of how it's mocked on here  [deebaybiie]  8/11/2013
@ally_b237 asehh..your bus has not yet touch down??  [wiz_ked]  8/12/2013
...
...
rt @coralynencs: @alexxmathias mdddddddddddddr  [alexxmathias]  8/12/2013

STOP: save your progress!

At the end of this step, you should have (at a minimum) Search.java and Tweet.java.

Create a subdirectory called step1/ inside your current Project1/ directory. Copy all .java files into it, and nothing else. You will do this at each step.


Step 2 (20 pts): Use a Queue

The previous step reads the file into an array of known size. This is not ideal because we want to read arbitrary sizes of files. It is also not ideal because arrays are difficult to resize and expensive to duplicate. This step replaces the Tweet array with a Queue instead. The output will be similar to Step 1, but more flexible because any sized file can now be read.

Reuse Your Queue and Node from the HW.

Create a Queue to store those Tweet objects. You should use your Queue HW's code! The HW's Queue used Nodes, and the Nodes contained a single String data. The only thing different here is that we don't have Strings but instead Tweets.

To be clear, you need two new classes for this step:

  1. Queue
  2. Node

As in the HW, you may keep Node as an inner class within Queue. You could alternatively split it out into its own Node.java if you prefer. Your task is then following:

  1. Modify Node to store a Tweet object.
  2. Create Queue.java with appropriate queue methods like void enqueue(Tweet), Tweet dequeue()
  3. Update Search.java to use the Queue instead of the Tweet[] array.

After you've updated your Queue to work with Tweets, add two more methods that will be useful in your main program:

  1. Create a member method in Queue: void printAll(). This should loop over the entire queue and call Tweet's toString() method above to print the tweets out. Don't alter the queue itself! Just print every tweet in order!
  2. Create a member method in Queue: int length(). This should return how many Nodes are currently in the Queue. It is up to you to decide how best to implement this.

You must alter your Step 1 code to use all of the above, and there should no longer be a Tweet[] array in use anywhere.

Expected Output.

> java Search sometweets.txt
Queue size: 33
i kicked daniels knee  [st0rmcl0aks]  8/11/2013
rt @jvson_: i think it's horrible that people feel embarrassed to take btec now because of how it's mocked on here  [deebaybiie]  8/11/2013
@ally_b237 asehh..your bus has not yet touch down??  [wiz_ked]  8/12/2013
...
...
rt @coralynencs: @alexxmathias mdddddddddddddr  [alexxmathias]  8/12/2013

STOP: save your progress!

Create a subdirectory called step2/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.


Step 3 (15 pts): Keyword Filter

Now we add search functionality! The user will enter search words, and you'll create a new queue with only matching tweets. You will change main() to allow for user input, as in previous labs and homeworks. You will prompt the user with a question mark "? ", and the user will type single search query words. For instance:

> java Search alltweets.txt
Queue size: 188671
? happy
Queue size: 1647
? everyone
Queue size: 19
? tired
Queue size: 1
? !dump
RT @enahzxo_: I'm so tired of trying to make everyone else happy.  [_TimiciaAri]  8/16/2013
Queue size: 1
? !quit
Goodbye!

When the user enters a search word, you will create a brand new queue from your current queue. This new queue will contain only those tweets that contain the search keyword.

We will also allow for one special user input, "!dump". If the user types this, print out all tweets in your current queue.

To achieve this end, implement the following:

Step 3 should result in this output exactly:

> java Search sometweets.txt
Queue size: 33
? you
Queue size: 7
? !dump
@ally_b237 asehh..your bus has not yet touch down??  [wiz_ked]  8/12/2013
@_xratedxbeauty lol how far is you?  [ayoo_imbadx]  8/12/2013
a dream is a wish your heart makes.  [emilyy_gant]  8/12/2013
rt @hayescrazed_xo: @hayniacs2327 you're welcome. you're welcome. you're welcome.  [hayniacs2327]  8/12/2013
@blackieechannn who do you have  [ashleyyymariek]  8/12/2013
you can only know the time you go to bed, but you can never know the time you sleep.. rt if u agree  [tolu1786]  8/12/2013
@stephanieirvine remember that time i converted you to fnl?  [wingster55]  8/12/2013
Queue size: 7
? dream
Queue size: 1
? !dump
a dream is a wish your heart makes.     [emilyy_gant]   8/12/2013
Queue size: 1
? !quit
Goodbye!

Now you're ready to try the big file:

> java Search alltweets.txt
Queue size: 188671
? happy
Queue size: 1647
? everyone
Queue size: 19
? tired
Queue size: 1
? !dump
RT @enahzxo_: I'm so tired of trying to make everyone else happy.  [_TimiciaAri]  8/16/2013
Queue size: 1
? !quit
Goodbye!

And one more for fun:

> java Search alltweets.txt
Queue size: 188671
? navy
Queue size: 42
? rihanna
Queue size: 10
? cheer
Queue size: 1
? !dump
RT @chaneIrihanna: cheer up navy #mtvhottest Rihanna  [Fentyisdahottes]  8/14/2013
Queue size: 1
? !quit
Goodbye!

STOP: save your progress!

Create a subdirectory called step3/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.


Step 4 (5 pts): Non-Keyword Filter

This step allows users to enter negated keywords. This means that they can enter a word such that you remove all tweets that actually contain the keyword. The end result is a list of tweets where none include the keyword.

Change your program to allow for this second type of query. The user input will be preceded by a minus sign, such as "-sad". You will then create a new queue from the current queue, but this time only keep those tweets that do not have the given word (e.g., tweets without "sad" in them).

  1. Write a member method in Queue:
    Queue filterForNotKeyword(String keyword)
  2. Change main() to allow for keywords that start with a minus sign: "-happy"
> java Search sometweets.txt
Queue size: 33
? the
Queue size: 8
? -a
Queue size: 1
? !dump
under the influence of music.  [zeandercarter]  8/12/2013
Queue size: 1
?

Here is one on the big file:

> java Search alltweets.txt
Queue size: 188671
? happy
Queue size: 1647
? world
Queue size: 31
? -birthday
Queue size: 26
? :)
Queue size: 3
? !dump
Good morning :)"@ratihibrahim: Good morning world, good morning good people, good morning happy sunday...."  [murty_pane]  8/17/2013
RT @tcookin: Happy World Photography Day guys! Keep travelling and keep clicking! :)  [vikrant7985]  8/19/2013
@Real_Liam_Payne If you'restill online and you read this please follow me?:)x You would make me SOhappy<3ILY so muchx You guys are myworldxX  [MelissaTweets13]  8/20/2013
Queue size: 3

STOP: save your progress!

Create a subdirectory called step4/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.


Step 5 (5 pts): Date Filter

Now we will add a filter based on the day that the tweet was tweeted. The user can type in a date, and you must then create a new queue from the current queue that only keeps tweets that occurred on the given day. Days will be entered with a plus sign (+) in front of them. The format will be:

+year-month-day
For example:
+2014-1-28

To accomplish this, write a member method in Queue: Queue filterForDate(String date)
You will also need to add helpful code to your Tweet class to support this filter. Overall, this filter should behave like the previous steps, but this time it only keeps the Tweets that occurred on the given day. You'll obviously have to split the String date up in your method, and compare it to the Tweet object's int fields.

> java Search sometweets.txt
Queue size: 33
? +2013-8-11
Queue size: 2
?
> java Search alltweets.txt
Queue size: 188671
? omg
Queue size: 1089
? +2013-8-17
Queue size: 102
?

STOP: save your progress!

Create a subdirectory called step5/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.


Step 6 (5 pts): Reset

Finally, add a !reset option. The !reset option lets the user start back at the original queue. Ignore everything that has been searched so far, and begin over again. Do NOT re-read the file from disk. You should always keep the original queue around, and none of your methods should have modified it if you did the above steps correctly.

> java Search alltweets.txt
Queue size: 188671
? army
Queue size: 103
? !reset
Queue size: 188671
? navy
Queue size: 42
? !quit
Goodbye!

STOP: save your progress!

Create a subdirectory called step6/ inside your current Project1/ directory. Move all .java files into it, and nothing else.


Grading

OOP Principles: 40%
Functionality: 40%
Style and Comments (javadoc): 20%

If your program is fully functional, but it's all in one big class, the maximum you can receive is 60%.

We will grade only the farthest part that is working in full. Partial credit will not be given for incomplete parts.

Submit and Review

Due date: 2359 on 10 Feb 2020

Have you commented your code? Does every method have comments before it? Are you following the javadoc specs? Is your code consistently and uniformly indented?

Keep all of your subdirectories stepN/ in place.

Finally, submit the farthest step that you got working. Only submit your .java files. Change into that subdirectory:

~/bin/submit -c=IC211 -p=project01 *.java

Log into the submission site: submit.cs.usna.edu and review the test cases run against your submission.