Automated ways of searching text are now used across the Web, within institutions, and across critical services like the healthcare industry. This project will have you write a basic search tool to filter large amounts of textual data in a fraction of a second. It would take a human days (weeks, months) to do the same thing by hand.
We are providing to you almost 200 thousand actual public tweets from Twitter. Your program will read them and provide a search interface. Dr. Chambers has a continual feed into Twitter that downloads millions of tweets every day, so this is just a fraction of a fraction of the data that we have here in the CS department. We use it to do fundamental research in artificial intelligence and information extraction, such as tracking international relations or DDoS attacks on networks. This short project will let you poke around the data.
Due date: 2359 on 10 Feb 2020
Honor: see course policy. You may not help with nor discuss any aspect of this project with anybody (except your instructor and MGSP).
Disclaimer
You may not distribute this project's data to anyone beyond the USNA. Our agreement with Twitter prevents copying and distributing.
This is raw, real world data. The standard disclaimer applies as it does whenever we step out onto the Web. You may come across offensive material. Please behave like mature adults and future officers, as appropriate.
Input Data
Input to your project's code will come from a single text file. The file contains one tweet per line. Each line contains a single tweet's 3 data fields (tweet, username, date) separated by tab ('\t') characters. You must read this file, create Tweet objects, and store them in an array. Here is an example of a line in the file:
omg i can't believe you said that!!! #whoknows stargazerz 2013-03-20
Do this now: create a Project1/ directory for your code.
Download the two tweet data files into your new directory: all tweets (15mb) and some tweets (2k)
Peek inside and take a look at the data.
Create a Tweet class.
You must write a class definition for a Tweet that appropriately stores the different pieces of information like the text, username, year, month, and day. Do not store the entire line as one String. You must follow proper Data Hiding techniques.
Make a constructor that uses this exact prototype definition:
public Tweet(String newtext, String newuser, String newdate) {}
Remember that the constructor's sole job is to initialize the variables in the Tweet object. You will likely need to split that String date into three int values. How do we do that in Java? See the String .split() method to break it up into 3 parts. This will result in a String for the year like "2013". To convert this to an int, Java provides a static method that you can call:
Integer.parseInt("2013") ==> 2013
Finally, create a member method in Tweet: String toString(). This method returns a single String that represents all of the Tweet's data. The method does not print anything, it simply builds and returns a String. What should the string look like? Follow this format (use a tab character '\t' to separate the 3 output fields):
tweet text goes first [usernameInBrackets] 1/30/2013
Create your main program.
Create Search.java with a main() method. Your main method will get the file path from the command line, and then call another method:
Tweet[] readFile(String path) {}
You must write readFile. Where is the proper place to define this method?
To return an array, you obviously need to initialize it to a certain size.
We will hardcode the size for this step: size 33 (this is how many tweets are in the file sometweets.txt).
Then write code to open the file, read its lines, and fill the Tweet[] array.
Expected Output
Make sure your running program looks exactly like the following. Note that the date format is different from the input. The year is listed last, and all fields are separated by slashes:
> java Search usage: java Search <tweets-file>
> java Search sometweets.txt Array size: 33 i kicked daniels knee [st0rmcl0aks] 8/11/2013 rt @jvson_: i think it's horrible that people feel embarrassed to take btec now because of how it's mocked on here [deebaybiie] 8/11/2013 @ally_b237 asehh..your bus has not yet touch down?? [wiz_ked] 8/12/2013 ... ... rt @coralynencs: @alexxmathias mdddddddddddddr [alexxmathias] 8/12/2013
STOP: save your progress!
At the end of this step, you should have (at a minimum) Search.java and Tweet.java.
Create a subdirectory called step1/ inside your current Project1/ directory. Copy all .java files into it, and nothing else. You will do this at each step.
The previous step reads the file into an array of known size. This is not ideal because we want to read arbitrary sizes of files. It is also not ideal because arrays are difficult to resize and expensive to duplicate. This step replaces the Tweet array with a Queue instead. The output will be similar to Step 1, but more flexible because any sized file can now be read.
Reuse Your Queue and Node from the HW.
Create a Queue to store those Tweet objects. You should use your Queue HW's code! The HW's Queue used Nodes, and the Nodes contained a single String data. The only thing different here is that we don't have Strings but instead Tweets.
To be clear, you need two new classes for this step:
As in the HW, you may keep Node as an inner class within Queue. You could alternatively split it out into its own Node.java if you prefer. Your task is then following:
After you've updated your Queue to work with Tweets, add two more methods that will be useful in your main program:
You must alter your Step 1 code to use all of the above, and there should no longer be a Tweet[] array in use anywhere.
Expected Output.
> java Search sometweets.txt Queue size: 33 i kicked daniels knee [st0rmcl0aks] 8/11/2013 rt @jvson_: i think it's horrible that people feel embarrassed to take btec now because of how it's mocked on here [deebaybiie] 8/11/2013 @ally_b237 asehh..your bus has not yet touch down?? [wiz_ked] 8/12/2013 ... ... rt @coralynencs: @alexxmathias mdddddddddddddr [alexxmathias] 8/12/2013
STOP: save your progress!
Create a subdirectory called step2/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.
Now we add search functionality! The user will enter search words, and you'll create a new queue with only matching tweets. You will change main() to allow for user input, as in previous labs and homeworks. You will prompt the user with a question mark "? ", and the user will type single search query words. For instance:
> java Search alltweets.txt Queue size: 188671 ? happy Queue size: 1647 ? everyone Queue size: 19 ? tired Queue size: 1 ? !dump RT @enahzxo_: I'm so tired of trying to make everyone else happy. [_TimiciaAri] 8/16/2013 Queue size: 1 ? !quit Goodbye!
When the user enters a search word, you will create a brand new queue from your current queue. This new queue will contain only those tweets that contain the search keyword.
We will also allow for one special user input, "!dump". If the user types this, print out all tweets in your current queue.
To achieve this end, implement the following:
boolean containsKeyword(String keyword)
This returns true if the Tweet's text contains the given word. We are considering a text to contain a word if that word appears as a substring of the entire tweet. In other words, the tweet "yes there are two" should return true if our search keyword is "the". You will find the String method contains(String) helpful. See the javadocs for details.
Queue filterForKeyword(String keyword)
This should return a new queue that contains all matching tweets.
IMPORTANT: your original Queue object should not be
changed!
Think about how to traverse the queue without moving the front/back pointers.
Step 3 should result in this output exactly:
> java Search sometweets.txt Queue size: 33 ? you Queue size: 7 ? !dump @ally_b237 asehh..your bus has not yet touch down?? [wiz_ked] 8/12/2013 @_xratedxbeauty lol how far is you? [ayoo_imbadx] 8/12/2013 a dream is a wish your heart makes. [emilyy_gant] 8/12/2013 rt @hayescrazed_xo: @hayniacs2327 you're welcome. you're welcome. you're welcome. [hayniacs2327] 8/12/2013 @blackieechannn who do you have [ashleyyymariek] 8/12/2013 you can only know the time you go to bed, but you can never know the time you sleep.. rt if u agree [tolu1786] 8/12/2013 @stephanieirvine remember that time i converted you to fnl? [wingster55] 8/12/2013 Queue size: 7 ? dream Queue size: 1 ? !dump a dream is a wish your heart makes. [emilyy_gant] 8/12/2013 Queue size: 1 ? !quit Goodbye!
Now you're ready to try the big file:
> java Search alltweets.txt Queue size: 188671 ? happy Queue size: 1647 ? everyone Queue size: 19 ? tired Queue size: 1 ? !dump RT @enahzxo_: I'm so tired of trying to make everyone else happy. [_TimiciaAri] 8/16/2013 Queue size: 1 ? !quit Goodbye!
And one more for fun:
> java Search alltweets.txt Queue size: 188671 ? navy Queue size: 42 ? rihanna Queue size: 10 ? cheer Queue size: 1 ? !dump RT @chaneIrihanna: cheer up navy #mtvhottest Rihanna [Fentyisdahottes] 8/14/2013 Queue size: 1 ? !quit Goodbye!
STOP: save your progress!
Create a subdirectory called step3/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.
This step allows users to enter negated keywords. This means that they can enter a word such that you remove all tweets that actually contain the keyword. The end result is a list of tweets where none include the keyword.
Change your program to allow for this second type of query. The user input will be preceded by a minus sign, such as "-sad". You will then create a new queue from the current queue, but this time only keep those tweets that do not have the given word (e.g., tweets without "sad" in them).
Queue filterForNotKeyword(String keyword)
> java Search sometweets.txt Queue size: 33 ? the Queue size: 8 ? -a Queue size: 1 ? !dump under the influence of music. [zeandercarter] 8/12/2013 Queue size: 1 ?
Here is one on the big file:
> java Search alltweets.txt Queue size: 188671 ? happy Queue size: 1647 ? world Queue size: 31 ? -birthday Queue size: 26 ? :) Queue size: 3 ? !dump Good morning :)"@ratihibrahim: Good morning world, good morning good people, good morning happy sunday...." [murty_pane] 8/17/2013 RT @tcookin: Happy World Photography Day guys! Keep travelling and keep clicking! :) [vikrant7985] 8/19/2013 @Real_Liam_Payne If you'restill online and you read this please follow me?:)x You would make me SOhappy<3ILY so muchx You guys are myworldxX [MelissaTweets13] 8/20/2013 Queue size: 3
STOP: save your progress!
Create a subdirectory called step4/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.
Now we will add a filter based on the day that the tweet was tweeted. The user can type in a date, and you must then create a new queue from the current queue that only keeps tweets that occurred on the given day. Days will be entered with a plus sign (+) in front of them. The format will be:
+year-month-dayFor example:
+2014-1-28
To accomplish this, write a member method in Queue: Queue filterForDate(String date)
You will also need to add helpful code to your Tweet class to support this filter. Overall, this filter should behave like the previous steps, but this time it only keeps the Tweets that occurred on the given day. You'll obviously have to split the String date up in your method, and compare it to the Tweet object's int fields.
> java Search sometweets.txt Queue size: 33 ? +2013-8-11 Queue size: 2 ?
> java Search alltweets.txt Queue size: 188671 ? omg Queue size: 1089 ? +2013-8-17 Queue size: 102 ?
STOP: save your progress!
Create a subdirectory called step5/ inside your current Project1/ directory. Copy all .java files into it, and nothing else.
Finally, add a !reset option. The !reset option lets the user start back at the original queue. Ignore everything that has been searched so far, and begin over again. Do NOT re-read the file from disk. You should always keep the original queue around, and none of your methods should have modified it if you did the above steps correctly.
> java Search alltweets.txt Queue size: 188671 ? army Queue size: 103 ? !reset Queue size: 188671 ? navy Queue size: 42 ? !quit Goodbye!
STOP: save your progress!
Create a subdirectory called step6/ inside your current Project1/ directory. Move all .java files into it, and nothing else.
OOP Principles: 40%
Functionality: 40%
Style and Comments (javadoc): 20%
If your program is fully functional, but it's all in one big class, the maximum you can receive is 60%.
We will grade only the farthest part that is working in full. Partial credit will not be given for incomplete parts.
Due date: 2359 on 10 Feb 2020
Have you commented your code? Does every method have comments before it? Are you following the javadoc specs? Is your code consistently and uniformly indented?
Keep all of your subdirectories stepN/ in place.
Finally, submit the farthest step that you got working. Only submit your .java files. Change into that subdirectory:
~/bin/submit -c=IC211 -p=project01 *.java
Log into the submission site: submit.cs.usna.edu and review the test cases run against your submission.