{"id":284,"date":"2020-12-05T16:20:12","date_gmt":"2020-12-05T16:20:12","guid":{"rendered":"http:\/\/www.sugata.in\/?p=284"},"modified":"2020-12-05T16:26:37","modified_gmt":"2020-12-05T16:26:37","slug":"wordclouds","status":"publish","type":"post","link":"http:\/\/www.sugata.in\/index.php\/2020\/12\/05\/wordclouds\/","title":{"rendered":"Wordclouds"},"content":{"rendered":"\n<p>As part of my  adventures in natural language processing and learning Python, I wanted to try to learn how to make word clouds. We see these things all the time in powerpoint presentations.<\/p>\n\n\n\n<p> They look fairly cool and the technology used to create them seems fairly straightforward. The computer counts the number of times a word appears in some text. Words that appear more frequently are bigger (ignoring common words like &#8220;the,&#8221; &#8220;of,&#8221; &#8220;a&#8221; and such) and words that appear sometimes but not as frequently are still shown, but take up less space. <\/p>\n\n\n\n<p>This was a bit before the 2020 election and I wanted to see if different news sources were covering different topics and I wanted to be able to visualize these differences easily. <\/p>\n\n\n\n<p>I wrote some code (some repurposed from https:\/\/www.datacamp.com\/community\/tutorials\/wordcloud-python ) to scrape the RSS feeds of various news sources and generate word clouds. These were the clouds I got from CNN and BBC. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"1024\" src=\"http:\/\/www.sugata.in\/wp\/wp-content\/uploads\/2020\/12\/cnnwc-1024x1024.png\" alt=\"\" class=\"wp-image-285\" srcset=\"http:\/\/www.sugata.in\/wp\/wp-content\/uploads\/2020\/12\/cnnwc-1024x1024.png 1024w, http:\/\/www.sugata.in\/wp\/wp-content\/uploads\/2020\/12\/cnnwc-300x300.png 300w, http:\/\/www.sugata.in\/wp\/wp-content\/uploads\/2020\/12\/cnnwc-150x150.png 150w, http:\/\/www.sugata.in\/wp\/wp-content\/uploads\/2020\/12\/cnnwc-768x768.png 768w, http:\/\/www.sugata.in\/wp\/wp-content\/uploads\/2020\/12\/cnnwc.png 1426w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>CNN Wordcloud (Oct 3rd 2020)<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"1024\" src=\"http:\/\/www.sugata.in\/wp\/wp-content\/uploads\/2020\/12\/bbcwc-1024x1024.png\" alt=\"\" class=\"wp-image-286\" srcset=\"http:\/\/www.sugata.in\/wp\/wp-content\/uploads\/2020\/12\/bbcwc-1024x1024.png 1024w, http:\/\/www.sugata.in\/wp\/wp-content\/uploads\/2020\/12\/bbcwc-300x300.png 300w, http:\/\/www.sugata.in\/wp\/wp-content\/uploads\/2020\/12\/bbcwc-150x150.png 150w, http:\/\/www.sugata.in\/wp\/wp-content\/uploads\/2020\/12\/bbcwc-768x768.png 768w, http:\/\/www.sugata.in\/wp\/wp-content\/uploads\/2020\/12\/bbcwc.png 1408w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>BBC Word cloud Oct 3rd 2020<\/figcaption><\/figure>\n\n\n\n<p>And this is the code I used to generate it. It&#8217;s currently set up for the CNN URLs, but you can put in what RSS feed URLs in and it should work  (you will need to appropriately indent it to get it working, the indents don&#8217;t paste properly, unfortunately).<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\n#import library\nimport requests\nfrom bs4 import BeautifulSoup\n#import pandas to create dataframe and CSV\nimport pandas as pd\nimport time\nfrom wordcloud import WordCloud, STOPWORDS \nimport matplotlib.pyplot as plt \n\n\n#enter URL\ncnnurls = &#91;\"http:\/\/rss.cnn.com\/rss\/cnn_topstories.rss\",\n        \"http:\/\/rss.cnn.com\/rss\/cnn_world.rss\",\n        \"http:\/\/rss.cnn.com\/rss\/cnn_us.rss\",\n        \"http:\/\/rss.cnn.com\/rss\/money_latest.rss\",\n        \"http:\/\/rss.cnn.com\/rss\/cnn_allpolitics.rss\",\n        \"http:\/\/rss.cnn.com\/rss\/cnn_tech.rss\",\n     #   \"http:\/\/rss.cnn.com\/rss\/cnn_health.rss\",\n     #   \"http:\/\/rss.cnn.com\/rss\/cnn_showbiz.rss\",\n     #   \"http:\/\/rss.cnn.com\/rss\/cnn_travel.rss\",\n        \"http:\/\/rss.cnn.com\/rss\/money_news_companies.rss\",\n        \"http:\/\/rss.cnn.com\/rss\/money_news_international.rss\",\n        \"http:\/\/rss.cnn.com\/rss\/money_news_economy.rss\"\n       ]\nbbcurls = &#91;\"http:\/\/feeds.bbci.co.uk\/news\/rss.xml\",\n           \"http:\/\/feeds.bbci.co.uk\/news\/world\/rss.xml\",\n           \"http:\/\/feeds.bbci.co.uk\/news\/uk\/rss.xml\",\n           \"http:\/\/feeds.bbci.co.uk\/news\/business\/rss.xml\",\n           \"http:\/\/feeds.bbci.co.uk\/news\/politics\/rss.xml\",\n          # \"http:\/\/feeds.bbci.co.uk\/news\/health\/rss.xml\",\n           \"http:\/\/feeds.bbci.co.uk\/news\/education\/rss.xml\",\n           \"http:\/\/feeds.bbci.co.uk\/news\/technology\/rss.xml\",\n           \"http:\/\/feeds.bbci.co.uk\/news\/entertainment_and_arts\/rss.xml\"\n          ]\nnews_items = &#91;]\nfor url in cnnurls:\n    resp = requests.get(url)\n\n    soup = BeautifulSoup(resp.content, features=\"xml\")\n\n    items = soup.findAll('item')\n\n    #print(len(items))\n    \n    #scarring HTML tags such as Title, Description, Links and Publication date\n    for item in items:\n        news_item = {}\n        news_item&#91;'title'] = item.title.text\n        news_item&#91;'description'] = item.description.text\n       # news_item&#91;'link'] = item.link.text\n       # news_item&#91;'pubDate'] = item.pubDate.text\n\n        news_items.append(news_item)\n    time.sleep(1)\n\ndf = pd.DataFrame(news_items,columns=&#91;'title','description'])\ndf.to_csv('CNNdata1.csv',index=False, encoding = 'utf-8')\n\ndf = pd.read_csv('CNNdata1.csv',encoding = 'utf-8') \n  \ncomment_words = '' \nstopwords = set(STOPWORDS) \n  \n# iterate through the csv file \nfor val in df.title: \n      \n    # typecaste each val to string \n    val = str(val) \n  \n    # split the value \n    tokens = val.split() \n      \n    # Converts each token into lowercase \n    for i in range(len(tokens)): \n        tokens&#91;i] = tokens&#91;i].lower() \n      \n    comment_words += \" \".join(tokens)+\" \"\n  \nwordcloud = WordCloud(width = 800, height = 800, \n                background_color ='white', \n                stopwords = stopwords, \n                min_font_size = 10).generate(comment_words) \n  \n# plot the WordCloud image                        \nplt.figure(figsize = (8, 8), facecolor = None) \nplt.imshow(wordcloud) \nplt.axis(\"off\") \nplt.tight_layout(pad = 0) \n  \nplt.show()<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>As part of my adventures in natural language processing and learning Python, I wanted to try to learn how to make word clouds. We see these things all the time in powerpoint presentations. They look fairly cool and the technology used to create them seems fairly straightforward. The computer counts the number of times a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[9,10],"tags":[],"_links":{"self":[{"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/posts\/284"}],"collection":[{"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/comments?post=284"}],"version-history":[{"count":2,"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/posts\/284\/revisions"}],"predecessor-version":[{"id":288,"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/posts\/284\/revisions\/288"}],"wp:attachment":[{"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/media?parent=284"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/categories?post=284"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/tags?post=284"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}