Explain your work. All work must be your own.
For code questions, you do not need to give working code or use exact forms of functions if you cannot remember. You can use "pseudocode" involving standard programming structures and functions.
Zipf's law states that the $n$th most common word in a corpus is 1/$n$ as common as the most common word. For example the 10th most common word is used 1/10 as often as the most common word (in English, probably 'the').
Given a string variable containing text for a long corpus, describe how you would determine the frequency of the words
In your answer, only use low-level python (no packages)
For example the first few characters of the string would be something like: "This enhances our commitment to open-source collaboration while providing additional protections for contributors and users alike. It provides a collection of working systems with different complexities."...
Write your code such that it operates on an input string variable mystring
# giving python code here.
# pseudocode would be similar but with true python function names and loop syntax replaced
# with some roughly similar creation
mystring = "this This THIS this enhances our commitment to open-source collaboration while providing additional protections for contributors and users alike. It provides a collection of working systems with different complexities."
word_freqs = {}
words = mystring.split(' ')
for word in words:
if word not in word_freqs:
word_freqs[word] = 1 # initialize counter
else:
word_freqs[word] = word_freqs[word]+1 # increment counter
print(word_freqs) # not many repeats in this case. note puncutation included in words. also case
{'this': 2, 'This': 1, 'THIS': 1, 'enhances': 1, 'our': 1, 'commitment': 1, 'to': 1, 'open-source': 1, 'collaboration': 1, 'while': 1, 'providing': 1, 'additional': 1, 'protections': 1, 'for': 1, 'contributors': 1, 'and': 1, 'users': 1, 'alike.': 1, 'It': 1, 'provides': 1, 'a': 1, 'collection': 1, 'of': 1, 'working': 1, 'systems': 1, 'with': 1, 'different': 1, 'complexities.': 1}
# handle punctuation and case
word_freqs = {}
words = mystring.split(' ')
for word in words:
word = word.lower() # convert all to lowercase
if word[-1] in {'.',',','?','!'}:
word = word[:-1] # chop off punctuation if found
if word not in word_freqs:
word_freqs[word] = 1
else:
word_freqs[word] = word_freqs[word]+1
print(word_freqs) # better (though won't be perfect)
{'this': 4, 'enhances': 1, 'our': 1, 'commitment': 1, 'to': 1, 'open-source': 1, 'collaboration': 1, 'while': 1, 'providing': 1, 'additional': 1, 'protections': 1, 'for': 1, 'contributors': 1, 'and': 1, 'users': 1, 'alike': 1, 'it': 1, 'provides': 1, 'a': 1, 'collection': 1, 'of': 1, 'working': 1, 'systems': 1, 'with': 1, 'different': 1, 'complexities': 1}
Suppose you want to find all decimal numbers in some text, with no errors or exceptions. E.g, numbers of the form 123.456. Describe how regex can achieve this in as much detail as you can, and note the particular issues that come up.
make pattern which finds:
Possible issue: numbers may lack decimal places, e.g. "5.00" may be written as simply "5". So we should check for those also.
# python example
import re
mystring = "This 123.456 ... 78 ... 9."
pattern = "\\d+\\.\\d+|\\d+"
re.findall(pattern, mystring)
['123.456', '78', '9']