Assignment 1: Hash Maps And Tweets (72 Points)

Chris Tralie

Table of Contents

Overview / Logistics

The purpose of this assignment is to implement one of the most important data structures in all of CS: a hash map, which is actually quite simply just an array of singly-linked lists. This will enable us to implement two abstract data types (ADTs): the Set and the Map (also known as a dictionary in python), which both have tons of applications. Before you proceed, review the notes at this link, which cover the background for hash tables.

Learning Objectives

  • Implement a singly-linked list in python
  • Implement hash tables using pythonic object oriented paradigms
  • Use hash table to fulfill set and map ADTs, including exception handling
  • Apply the dictionary API to natural language processing

Data / Starter Code

Click here to download the starter code for this assignment. Actually, the starter code is quite minimal; it mainly has a dataset you'll be using in the last task.

What To Submit

Submit your files wizard.py and hash.py to canvas, as well as any other .py files and notebooks you made. Also submit answers to the following questions on Canvas

  1. The name of your buddy, if you chose to work with one.
  2. Any other concerns that you have. For instance, if you have a bug that you were unable to solve but you made progress, write about it. The more you articulate the problem the more partial credit you will receive (fine to leave this blank)

Programming Tasks

To explain the data structures you'll be implementing, let's first look at python's built in versions of them. First, we'll examine a set, which is a unordered collection of objects without repetition

This prints

Notice how the numbers are not in the same order we put them in (since the set order is arbitrary), and that even though we added 2 twice, it only showed up once (no repetitions).

Actually, the set is quite flexible, and we can add anything to it, as long as it has a hash code. For example, we can add

and then we'll see

But then if we try to do

then we get

This is because a list can change, so it's impossible to generate a unique hash code for it, since one of the properties of hash codes is that they remain fixed, so they can only work with "immutable" objects.

Part 1: Hash Codes And Object Equality in Python (6 Points)

Let's now consider a more interesting class that's been provided with the assignment. I've created a Wizard class in wizard.py that mirrors the Harry Potter hashing class exercise. Let's run the following code from within some .py file or notebook in the same directory as wizard.py:

but then this prints out

It seems like the set's not working, because it's not supposed to have more than one copy of the same object! But let's drill down and look at the hash codes. We can do this by calling the built-in method hash in python:

When I run this, I get the following on my computer (but you will get something different every time you run this, which will become clear in a moment):

What on earth are these numbers, and how did python come up with them? Let's try one more thing. If we run the built-in id() method, it will tell us the memory address of the object passed to it:

which yields the following

As it turns out, since we haven't told python how to make a hash code for a wizard, it ends up using the memory address // 16.

However, even so, there are some collisions here: the only unique hash codes are 8785000196838 and 8785000196229. So why are we still seeing 5 unique objects in the set? Well, python doesn't actually know that they're equal; if we don't tell python how to compare objects, it will default to comparing them by their memory address. Since each of the 5 objects has a different memory address, they are considered unique.

Your Task

Write code in wizard.py to override the default behavior of object equality and hashing for the Wizard class:

  1. Implement the __hash__ method, which takes no parameters (other than self), and which returns the month*31 + day
  2. Implement the __eq__ method, which takes one parameter (in addition to self) which is object to which to compare it. Return True if the name, month, day, and year are equal between this object and the other object, and False otherwise.

If this works properly, the hash codes for all of the Snape wizards above should be 83, and when you run

It should print True. Finally, you should only see one Snape wizard in the set if you add them in the loop.


Part 2: HashSet (24 Points)

Now that we've convinced ourselves that the built-in python set class works properly, let's implement our own verison of a set from scratch using a hash table.

Your Task

Create a singly-linked list class and/or linked node class that wraps around a key object, which is what will be stored in the set. Create a class HashSet that aggregates and array of linked lists, and which has the following instance methods (that all have self as a first parameter)

  • (4 Points) __init__: The construtor should take one parameter specifying the number of buckets to use, and you should initialize an array of empty linked lists with that number of buckets

  • (4 Points) __contains__: A method that takes one parameter, the key, and which returns True if the key is in the linked list and False otherwise. When you've finished implementing this method, then saying

    Should print out True. Note the "syntactic sugar" with the in keyword, which you can use to make your code look nicer. Under the hood, saying x in y in python is equivalent to calling y.__contains__(x)

  • (4 Points) add: Takes one parameter and adds that object to the set. You should compute the hash code of the object, find its appropriate bucket index, and add it if it doesn't already exist (we don't want duplicate objects).

  • (4 Points) remove(obj): Remove obj from the hash table if it exists in the table, or throw a KeyError if the object isn't in the table. You can do this with the following code:

  • (4 Points) __len__: A method with no parameters (other than self) which returns how many elements are in the set. This should run in constant time; you should not have to loop through all of the linked lists to count. Instead, store a member variable that keeps track of the number of objects as they are added and deleted so you can simply return that member variable.

    Note also the syntactic sugar; if we say len(s), this is equivalent to saying s.__len__()

  • (4 Points) keys: A method with no parameters that returns all of the elements in the set as a python list.

If you're not sure where to get started, have a look at the needle in a haystack notes, as well as some starter code for linked lists that we made in class.

Tips

The first thing you should do in the __contains__, add, and remove methods is to compute the hash code for the object that's being passed in and % it by the number of buckets to figure out which bucket you should start looking in. We will assume here that the objects passed along have implemented __hash__, so that calling hash(obj) will give the hash code. Python already has built-in hash codes for ints, strings, floats, etc.

If this is all working properly, then the following code:

should print something like this (note that the order is arbitrary):


Part 3: HashMap (17 Points)

Sometimes, we may not know everything about a wizard, but we do know partial information like their name. In this case, it's more appropriate to use a map ADT instead of a set ADT to look up more info from the partial info we have. In this ADT, the key is something hashable that we look up in the hash table, but it is glued to something called a value which is the associated thing we're trying to look up. Python calls its implementation of a map a dictionary (in Java it's a Map, in C++ STL it's a map, in PHP it's an called an "associative array", ...). Here's how we might set up some information in a dictionary:

after which we'll see the following output

Note that a python dictionary has a quite convenient syntax; it looks just like an array, but we index it with keys instead of 0-indexed integers. We will use built-in python methods to replicate this syntactic sugar in our implementation of HashMap

Your task

Create a class HashMap which, like the HashSet class, aggregates an array of linked lists. You'll just need to tweak the node class in your linked list to hold a value in addition to a key. For those who had CS 174 with me, it's similar to the last assignment I gave in that class. Below are the instance methods you'll need to implement:

  • (2 Points) __init__: The construtor should take one parameter specifying the number of buckets to use, and you should initialize an array of empty linked lists with that number of buckets

  • (2 Points) __contains__: A method that takes one parameter, the key, and which returns True if the key is in the linked list and False otherwise. This should be incredibly similar to the __contains__ method in HashSet that you did before.

  • (3 Points) __setitem__: A method that takes two parameters, the key and the value that should be associated to that key.

    • If the key doesn't exist in the map, add the key/value pair.
    • If the key already exists in the map, update its value to be the value that's being passed here
    This should be fairly similar to the add method in HashSet that you did before.

  • (3 Points) __getitem__: A method that takes one parameter, the key, and which returns the value associated to the key, or which raises a KeyError if that key doesn't exist in the map. This should be fairly similar in structure to the __contains__ method

  • (3 Points) __delitem__: A method that takes one parameter, the key, and which deletes the key/value pair associated to this key, or which raises a KeyError if that key doesn't exist in the map. This should be fairly similar in structure to the remove method in your HashSet class

  • (2 Points) __len__: A method with no parameters (other than self) which returns how many elements are in the set. As in the HashSet, this should run in constant time

  • (2 Points) keys: A method with no parameters that returns all of the keys in the map as a python list.

Tips

If you replace the example above with m = HashMap(10) instead of m = {}, you should get exactly the same behavior. Just make sure your keys are getting evenly distributed throughout the buckets


Part 4: Rebalancing (10 Points)

In order for __contains__, __getitem__, __setitem__, and __delitem__ to work in nearly constant time on average, we need to make sure the number of buckets is on the same order as the number of elements stored in the map. For example, if we're storing 100 key/value pairs, we should have around 100 buckets, and if we're storing a million key/value pairs, we should have around a million buckets.

This may seem wasteful since many buckets will be empty, but at worst, we're still have within a constant factor of the the storage of an array with N elements; we now have N pointers to buckets, plus the N elements and the overhead of the linked nodes. What we gain on average is that each bucket only has 1 element in it. Furthermore, if the hash codes are sufficiently staistically random, then most buckets will not have much more than 1 element in them.

In what follows, you will implement a scheme to ensure that the buckets stay balanced in this way, while not incuring much additional computation on average.

Your Task

Modify the constructor to take two parameters: the initial number of buckets and a "load factor." Then, modify the __setitem__ method so that if the number of elements in the map goes beyond load_factor * # buckets, then you should double the number of buckets and re-add every key/value pair that was in there before to the new buckets. This is exactly like how a C++ vector, Java ArrayList, and python list [] work to keep list accesses and adding to the end of the list constant amortized time.

Tips

To keep your code organized, you should make an internal helper instance method to do the rebalancing, which you call from within __setitem__ at the appropriate time.

Make sure the rebalancing doesn't break your code. If everything is working properly and you start off with an initial capacity of 10 and a load factor of 0.75, then you should see 160 buckets after running the following code.


Part 5: Russian Troll Tweet Wrangling (15 Points)

Now that you have an industrial strength hash map implementation, it's time to stress test it with an application that's super relevant to the 2024 election cycle. I have provided a file tweets.json with all of the "Left Troll" and "Right Troll" tweets Russian Troll Tweet Dataset archived by FiveThirtyEight (you can read more about it here). This is 476,215 tweets total. To load it, use this code:

Each tweet is a dictionary. For instance, the first tweet, tweets[0], looks like this:

Your Task

At the bottom of your hash.py file, use the HashMap class you made to print out the top 200 trigrams that the left trolls and the right trolls use. A trigram is a tuple of 3 contiguous words in a sentence. For instance, the sentence "CS 271 is my favorite class" has the trigrams

  • ("CS", "271", "is")
  • ("271", "is", "my")
  • ("is", "my", "favorite")
  • ("my", "favorite", "class")

Finding all trigrams across all tweets will be a stress test for your implementation, as there are over 2 million unique right trigrams and over 600k unique left trigrams. To get the best results, we'll need to do a basic version "text normalization" to deal with punctuation and capitalization

In this example, this gives us

Hints

The following algorithm will do this task efficiently:

  1. Create a HashMap object where the key is a tuple of 3 strings for the trigram, and the associated value is the number of times that trigram appears across all tweets

  2. Once this is filled in, grab all of the keys into a list, and create a parallel list of the associated counts

  3. Use np.argsort to find the indices of the words with the highest counts (we will talk more about sorting algorithms in the third unit of the course, but we'll take this as a given for now). For instance,

    Will yield [2, 0, 1]

NOTE: If you haven't done rebalancing correctly, this code will take an extremely long time. If you've done it correctly, the left trolls should take under 20 seconds, and the right trolls should take under a minute. If your code is taking a really long time, swap in a dictionary {} for your HashMap object just to test your logic on this part. Then, once it works with the regular dictionary, go and debug your rebalancing.

Results

Content Warning: The results contain highly emotionally charged phrases dealing with issues of race, gender, immigration status, and political violence.

If you've done this correctly, here are the first 200 you should see for type LeftTroll (NOTE: trigrams that have the same number of counts are in an arbitrary order)

1: ('black', 'lives', 'matter') (310)
2: ('the', 'first', 'black') (305)
3: ('one', 'of', 'the') (257)
4: ('bishop', 'ew', 'jackson') (234)
5: ('ew', 'jackson', 'calls') (234)
6: ('podcast', 'bishop', 'ew') (234)
7: ('jackson', 'calls', '#blacklivesmatter') (234)
8: ('audio', 'podcast', 'bishop') (234)
9: ('now', 'audio', 'podcast') (234)
10: ('newsone', 'now', 'audio') (234)
11: ('#blacklivesmatter', 'is', 'movement') (221)
12: ('calls', '#blacklivesmatter', 'is') (221)
13: ('a', 'black', 'man') (216)
14: ('is', 'movement', 'disgraceful') (210)
15: ('martin', 'luther', 'king') (197)
16: ('on', 'this', 'day') (187)
17: ('the', 'united', 'states') (178)
18: ('the', 'truth', 'about') (177)
19: ('we', 'need', 'to') (172)
20: ('to', 'be', 'a') (162)
21: ('this', 'is', 'the') (153)
22: ('killed', 'by', 'police') (151)
23: ('of', 'the', 'black') (150)
24: ('the', 'black', 'panther') (148)
25: ('in', 'the', 'us') (148)
26: ('we', 'have', 'to') (143)
27: ('black', 'panther', 'party') (141)
28: ('this', 'is', 'a') (135)
29: ('this', 'day', 'in') (133)
30: ('this', 'is', 'what') (131)
31: ('i', 'ask', 'you') (129)
32: ('luther', 'king', 'jr') (129)
33: ('is', 'going', 'to') (126)
34: ('is', 'not', 'a') (125)
35: ('became', 'the', 'first') (122)
36: ('in', 'the', 'world') (122)
37: ('to', 'a', '@youtube') (118)
38: ('a', '@youtube', 'playlist') (118)
39: ('i', 'added', 'a') (118)
40: ('video', 'to', 'a') (118)
41: ('a', 'video', 'to') (118)
42: ('added', 'a', 'video') (118)
43: ('the', 'white', 'house') (116)
44: ('there', 'is', 'no') (112)
45: ('first', 'african', 'american') (111)
46: ('the', 'first', 'african') (111)
47: ('you', 'a', 'question') (110)
48: ('ask', 'you', 'a') (110)
49: ('a', 'lot', 'of') (109)
50: ('in', 'american', 'falls') (108)
51: ('you', 'want', 'to') (108)
52: ('was', 'the', 'first') (107)
53: ('in', 'front', 'of') (106)
54: ('a', 'black', 'woman') (105)
55: ('is', 'the', 'first') (103)
56: ('black', 'people', 'are') (103)
57: ('rest', 'in', 'peace') (101)
58: ('of', 'the', 'year') (101)
59: ('first', 'black', 'woman') (100)
60: ('you', 'have', 'to') (99)
61: ('this', 'is', 'why') (97)
62: ('did', 'you', 'know') (97)
63: ('what', 'do', 'you') (96)
64: ('if', 'you', 'are') (96)
65: ('black', 'woman', 'to') (95)
66: ('this', 'is', 'how') (93)
67: ('the', 'first', 'africanamerican') (92)
68: ('the', 'black', 'panthers') (90)
69: ('in', 'the', 'united') (90)
70: ('i', 'want', 'to') (89)
71: ('to', 'have', 'a') (89)
72: ('every', 'download', 'for') (89)
73: ('get', 'for', 'every') (89)
74: ('shot', 'and', 'killed') (89)
75: ('for', 'every', 'download') (89)
76: ('#blackowned', '#buyblack', '#blacktwitter') (89)
77: ('download', 'for', 'every') (89)
78: ('for', 'every', 'biz') (89)
79: ('the', '#blacklivesmatter', 'movement') (86)
80: ('#blacklivesmatter', '#justiceformariowoods', '#mariowoods') (85)
81: ('#justiceformariowoods', '#mariowoods', '#sanfrancisco') (85)
82: ('#mariowoods', '#sanfrancisco', '#sanfranciscoshooting') (85)
83: ('#sanfrancisco', '#sanfranciscoshooting', '#sfpd') (84)
84: ('for', 'no', 'reason') (84)
85: ('repeatedly', 'interrupted', 'by') (84)
86: ('dont', 'want', 'to') (83)
87: ('to', 'go', 'to') (83)
88: ('anniversary', 'of', 'the') (83)
89: ('th', 'anniversary', 'of') (83)
90: ('if', 'you', 'want') (81)
91: ('in', 'police', 'custody') (80)
92: ('the', 'fact', 'that') (80)
93: ('may', 'i', 'ask') (80)
94: ('share', 'this', 'flyer') (79)
95: ('we', 'will', 'never') (78)
96: ('#breakingnews', '#blacklivesmatter', '#justiceformariowoods') (77)
97: ('thank', 'you', 'for') (77)
98: ('the', 'super', 'bowl') (76)
99: ('of', 'black', 'people') (74)
100: ('the', 'th', 'anniversary') (74)
101: ('will', 'never', 'forget') (74)
102: ('interrupted', 'by', 'protesters') (74)
103: ('in', 'case', 'you') (73)
104: ('#sanfranciscoshooting', '#sfpd', '#bayview') (73)
105: ('the', 'first', 'time') (73)
106: ('believe', 'in', 'love') (71)
107: ('for', 'black', 'people') (71)
108: ('word', 'of', 'truth') (71)
109: ('a', 'word', 'of') (71)
110: ('black', 'women', 'who') (70)
111: ('the', 'right', 'to') (70)
112: ('you', 'have', 'a') (70)
113: ('you', 'know', 'that') (70)
114: ('of', 'the', 'most') (69)
115: ('#phosphorusdisaster', 'in', 'american') (68)
116: ('i', 'dont', 'know') (68)
117: ('in', 'the', 's') (68)
118: ('want', 'to', 'be') (68)
119: ('if', 'you', 'dont') (68)
120: ('to', 'be', 'the') (67)
121: ('black', 'women', 'are') (67)
122: ('is', 'the', 'only') (67)
123: ('dont', 'have', 'to') (67)
124: ('do', 'you', 'think') (67)
125: ('all', 'lives', 'matter') (66)
126: ('who', 'died', 'in') (65)
127: ('is', 'the', 'most') (65)
128: ('shot', 'by', 'police') (65)
129: ('#servicessale', 'download', 'app') (65)
130: ('app', 'for', 'details') (65)
131: ('#blacktwitter', '#servicessale', 'download') (65)
132: ('download', 'app', 'for') (65)
133: ('click', 'here', 'for') (64)
134: ('there', 'is', 'a') (64)
135: ('we', 'are', 'not') (63)
136: ('trump', 'repeatedly', 'interrupted') (63)
137: ('flyer', 'and', 'join') (63)
138: ('is', 'like', 'a') (63)
139: ('years', 'in', 'prison') (63)
140: ('for', 'the', 'first') (63)
141: ('traumatic', 'stress', 'disorder') (63)
142: ('protesters', 'shouting', '#blacklivesmatter') (63)
143: ('by', 'protesters', 'shouting') (63)
144: ('how', 'racism', 'causes') (62)
145: ('racism', 'causes', 'post') (62)
146: ('part', 'of', 'the') (62)
147: ('in', 'order', 'to') (62)
148: ('died', 'in', 'police') (62)
149: ('do', 'you', 'have') (62)
150: ('#health', '#racism', '#wellness') (62)
151: ('stress', 'disorder', 'ptsd') (62)
152: ('post', 'traumatic', 'stress') (62)
153: ('its', 'time', 'to') (62)
154: ('#racism', '#wellness', '#blacklivesmatter') (62)
155: ('causes', 'post', 'traumatic') (62)
156: ('he', 'is', 'a') (61)
157: ('is', 'one', 'of') (61)
158: ('out', 'of', 'the') (61)
159: ('my', 'name', 'is') (61)
160: ('celebrate', 'the', 'th') (61)
161: ('in', 'the', 'back') (61)
162: ('people', 'like', 'this') (61)
163: ('join', 'us', 'to') (61)
164: ('the', 'only', 'way') (61)
165: ('and', 'join', 'our') (61)
166: ('brothers', 'and', 'sisters') (61)
167: ('i', 'cant', 'believe') (60)
168: ('you', 'need', 'to') (60)
169: ('is', 'the', 'best') (60)
170: ('of', 'police', 'brutality') (60)
171: ('get', 'away', 'with') (60)
172: ('#blacklivesmatter', '#blacklivesmatter', '#blacklivesmatter') (60)
173: ('black', 'history', 'month') (60)
174: ('to', 'celebrate', 'the') (60)
175: ('judge', 'rips', 'media') (59)
176: ('unarmed', 'black', 'man') (59)
177: ('a', 'time', 'to') (59)
178: ('a', 'part', 'of') (59)
179: ('can', 'i', 'ask') (58)
180: ('this', 'is', 'so') (58)
181: ('the', 'age', 'of') (58)
182: ('the', 'halftime', 'show') (58)
183: ('to', 'become', 'a') (58)
184: ('have', 'to', 'be') (58)
185: ('trump', 'is', 'a') (58)
186: ('were', 'killed', 'by') (58)
187: ('go', 'back', 'to') (57)
188: ('more', 'likely', 'to') (57)
189: ('name', 'is', 'ghani') (57)
190: ('us', 'to', 'celebrate') (57)
191: ('every', 'time', 'i') (57)
192: ('the', 'death', 'of') (57)
193: ('fatal', 'shooting', 'of') (56)
194: ('happy', 'birthday', 'to') (56)
195: ('black', 'on', 'black') (56)
196: ('people', 'of', 'color') (56)
197: ('of', 'the', 'united') (55)
198: ('women', 'got', 'natural') (55)
199: ('you', 'dont', 'have') (55)
200: ('these', 'women', 'got') (55)

And here are the top 200 you should see for type RightTroll

1: ('cnn', 'is', '#fakenews') (3299)
2: ('#fakenews', 'cnn', 'is') (2762)
3: ('is', '#fakenews', 'cnn') (2759)
4: ('our', 'patriot', 'army') (1561)
5: ('https//tco/mfbjijyl', 'rewind', '/') (1489)
6: ('the', 'white', 'house') (1444)
7: ('enlist', 'in', 'the') (1361)
8: ('patriot', 'army', 'at') (1352)
9: ('the', '#usfa', 'at') (1256)
10: ('rtamerica', 'to', '#maga') (1208)
11: ('enlist', 'in', 'our') (1208)
12: ('rt', 'rtamerica', 'to') (1208)
13: ('retweet', 'rt', 'rtamerica') (1196)
14: ('in', 'the', '#usfa') (1187)
15: ('in', 'our', 'patriot') (1132)
16: ('enlist', 'with', 'us') (920)
17: ('with', 'us', 'at') (906)
18: ('via', 'the', 'foxnews') (889)
19: ('the', 'foxnews', 'app') (856)
20: ('this', 'is', 'the') (759)
21: ('we', 'need', 'to') (680)
22: ('supporters', 'react', 'to') (650)
23: ('@cnn', '@cnni', '@cnnpolitics') (648)
24: ('@wolfblitzer', '@jaketapper', '@theleadcnn') (648)
25: ('@cnni', '@cnnpolitics', '@cnnsitroom') (648)
26: ('@theleadcnn', '@brianstelter', '@ananavarro') (648)
27: ('@brianstelter', '@ananavarro', '@donlemon') (648)
28: ('@cnnpolitics', '@cnnsitroom', '@wolfblitzer') (648)
29: ('@cnnsitroom', '@wolfblitzer', '@jaketapper') (648)
30: ('@vanjones', '@andersoncooper', '@ac') (648)
31: ('@ananavarro', '@donlemon', '@vanjones') (648)
32: ('@donlemon', '@vanjones', '@andersoncooper') (648)
33: ('@jaketapper', '@theleadcnn', '@brianstelter') (648)
34: ('the', 'truth', 'about') (642)
35: ('trump', 'supporters', 'react') (635)
36: ('rt', 'if', 'you') (631)
37: ('black', 'lives', 'matter') (629)
38: ('one', 'of', 'the') (607)
39: ('you', 'wont', 'believe') (583)
40: ('kim', 'jong', 'un') (571)
41: ('the', 'united', 'states') (570)
42: ('there', 'is', 'no') (549)
43: ('the', 'american', 'people') (537)
44: ('is', '#fakenews', '#fakenews') (533)
45: ('fire', 'and', 'fury') (526)
46: ('to', 'be', 'a') (517)
47: ('you', 'need', 'to') (509)
48: ('is', 'going', 'to') (507)
49: ('its', 'time', 'to') (494)
50: ('president', 'trump', 'just') (493)
51: ('a', 'white', 'supremacist') (487)
52: ('america', 'great', 'again') (485)
53: ('a', 'lot', 'of') (483)
54: ('this', 'is', 'what') (481)
55: ('you', 'are', 'a') (466)
56: ('the', 'end', 'of') (464)
57: ('you', 'know', 'that') (463)
58: ('bostick', 'is', '#fakenews') (456)
59: ('dani', 'bostick', 'is') (456)
60: ('make', 'america', 'great') (453)
61: ('this', 'is', 'a') (449)
62: ('#trumptrain', '#maga', '#potus') (449)
63: ('look', 'what', 'he') (440)
64: ('stand', 'up', 'for') (439)
65: ('#fakenews', '#fakenews', '#maga') (439)
66: ('what', 'he', 'said') (438)
67: ('san', 'juan', 'mayor') (433)
68: ('retweet', 'if', 'you') (433)
69: ('#usfa', 'at', 'https//tco/mjnlcxvf') (431)
70: ('do', 'you', 'think') (429)
71: ('fake', 'news', 'media') (422)
72: ('black', 'trump', 'supporter') (416)
73: ('we', 'do', 'on') (415)
74: ('remember', 'we', 'do') (415)
75: ('diamond', 'and', 'silk') (410)
76: ('out', 'of', 'the') (408)
77: ('in', 'the', 'world') (407)
78: ('you', 'are', 'not') (407)
79: ('you', 'want', 'to') (406)
80: ('on', 'north', 'korea') (405)
81: ('breaking', 'north', 'korea') (403)
82: ('thank', 'you', 'for') (391)
83: ('need', 'to', 'know') (386)
84: ('to', 'stand', 'for') (385)
85: ('for', 'trumps', 'assassination') (383)
86: ('who', 'called', 'for') (381)
87: ('called', 'for', 'trumps') (379)
88: ('if', 'you', 'think') (376)
89: ('to', 'impeach', 'trump') (372)
90: ('in', 'the', 'us') (371)
91: ('for', 'the', 'anthem') (371)
92: ('dont', 'want', 'to') (367)
93: ('army', 'at', 'https//tco/mjnlcxvf') (367)
94: ('i', 'want', 'to') (365)
95: ('this', 'is', 'why') (364)
96: ('well', 'well', 'well') (363)
97: ('is', 'not', 'a') (363)
98: ('is', 'going', 'viral') (359)
99: ('we', 'have', 'to') (357)
100: ('look', 'who', 'is') (355)
101: ('president', 'trump', 'is') (354)
102: ('traitor', 'john', 'mccain') (343)
103: ('@andersoncooper', '@ac', '@jimacosta') (342)
104: ('@ac', '@jimacosta', 'cnn') (342)
105: ('@jimacosta', 'cnn', 'is') (342)
106: ('is', '#fakenews', 'dani') (342)
107: ('#fakenews', 'dani', 'bostick') (342)
108: ('with', 'north', 'korea') (341)
109: ('remember', 'this', 'on') (337)
110: ('#draintheswamp', '#trumptrain', '#maga') (333)
111: ('needs', 'to', 'be') (333)
112: ('if', 'you', 'dont') (332)
113: ('this', 'is', 'how') (331)
114: ('refugees', 'are', 'terrorists') (330)
115: ('take', 'a', 'knee') (327)
116: ('if', 'you', 'want') (327)
117: ('if', 'you', 'agree') (323)
118: ('#top', 'rt', 'terrebehlog') (319)
119: ('you', 'have', 'to') (313)
120: ('not', 'going', 'to') (312)
121: ('the', 'charlottesville', 'tragedy') (308)
122: ('do', 'on', '//') (306)
123: ('to', 'do', 'with') (301)
124: ('there', 'is', 'a') (297)
125: ('voted', 'for', 'trump') (297)
126: ('trump', 'condemns', 'all') (296)
127: ('we', 'have', 'a') (294)
128: ('what', 'do', 'you') (293)
129: ('us', 'at', 'https//tco/mjnlcxvf') (293)
130: ('get', 'rid', 'of') (291)
131: ('donald', 'trump', 'is') (291)
132: ('new', 'poll', 'shows') (290)
133: ('join', 'our', 'patriot') (287)
134: ('new', 'york', 'times') (286)
135: ('blaming', 'both', 'sides') (286)
136: ('the', 'deep', 'state') (286)
137: ('and', 'its', 'bad') (285)
138: ('i', 'voted', 'for') (284)
139: ('senator', 'who', 'called') (283)
140: ('want', 'you', 'to') (282)
141: ('stand', 'for', 'the') (282)
142: ('https//tco/mfbjijhbv', 'rewind', '/') (280)
143: ('will', 'make', 'you') (280)
144: ('are', 'trying', 'to') (279)
145: ('in', 'the', 'white') (278)
146: ('trump', 'is', 'a') (272)
147: ('we', 'are', 'not') (272)
148: ('to', 'attack', 'trump') (271)
149: ('our', 'patriots', 'at') (269)
150: ('trump', 'effect', 'us') (269)
151: ('the', 'left', 'is') (269)
152: ('we', 'the', 'people') (267)
153: ('are', 'going', 'to') (265)
154: ('traitor', 'jeff', 'flake') (265)
155: ('are', 'not', 'a') (263)
156: ('breaking', 'president', 'trump') (261)
157: ('you', 'are', 'an') (257)
158: ('americans', 'agree', 'with') (257)
159: ('is', 'trying', 'to') (256)
160: ('is', 'the', 'most') (254)
161: ('this', 'on', '//') (252)
162: ('your', 'enlistment', 'at') (251)
163: ('up', 'for', 'america') (251)
164: ('need', 'to', 'be') (249)
165: ('look', 'how', 'much') (249)
166: ('poll', 'of', 'gop') (248)
167: ('breaking', 'look', 'who') (247)
168: ('marist', 'poll', 'of') (247)
169: ('if', 'you', 'are') (245)
170: ('it', 'was', 'a') (244)
171: ('agree', 'with', 'trump') (244)
172: ('at', 'https//tco/mjnlcxvf', 'stand') (242)
173: ('what', 'happens', 'when') (241)
174: ('us', 'freedom', 'army') (241)
175: ('i', 'dont', 'think') (240)
176: ('to', 'north', 'korea') (237)
177: ('are', 'a', 'racist') (237)
178: ('the', 'right', 'to') (236)
179: ('#trumptrain', '#maga', 'potus') (235)
180: ('senator', 'you', 'are') (235)
181: ('trump', 'is', 'racist') (234)
182: ('#pjnet', '#tcot', '#ccot') (234)
183: ('#gopdebate', 'you', 'know') (233)
184: ('a', 'native', 'american') (233)
185: ('bad', 'news', 'for') (232)
186: ('religion', 'of', 'peace') (230)
187: ('pocahontas', 'you', 'are') (229)
188: ('racist', 'fraud', 'you') (229)
189: ('an', 'awful', 'senator') (229)
190: ('hey', 'pocahontas', 'you') (229)
191: ('fraud', 'you', 'are') (229)
192: ('awful', 'senator', 'you') (229)
193: ('a', 'racist', 'fraud') (229)
194: ('are', 'an', 'awful') (229)
195: ('not', 'a', 'native') (229)
196: ('and', 'fury', 'comments') (228)
197: ('in', 'front', 'of') (227)
198: ('is', 'the', 'best') (225)
199: ('called', 'trump', 'a') (224)
200: ('i', 'dont', 'want') (224)

P.S.

Hopefully now it makes sense why they're called "hash tags"!

Also, if you're curious, click here to see more advanced preprocessing you can do to normalize the text.