Sections
Elements of Machine Learning V: spaCy and Natural Language Processing
[avatar user=”malm” size=”small” align=”left” link=”file” /]
Over the last four weeks I’ve been running through an exercise to demonstrate how machine learning techniques can be applied to a number of specific problems including fitting functions to a set of data points to how to select the best classifier for a given dataset. The last couple of posts have focussed specifically on word vectors. The linking theme across the various discussions and code samples provided has been the use of Python and specifically the scikit-learn machine learning library.
This week’s post builds on last week’s one where I focussed on basic sentiment analysis using scikit-learn but also introduced the spaCy natural language processing framework. I indicated that it could be used in conjunction with scikit-learn to build a more sophisticated approach. This code sample is a flavour of what I did to integrate the two. In it I load up the same IMDB database csv file built last week to a pandas dataframe, process the data with spaCy to check that side is all present and correct and then integrate it into a scikit-learn pipeline. The tokenizer here is built using spaCy and the text transformer using simple Python replace calls. To run this code I set up an AWS 2GB t2.small instance running 64-bit Ubuntu 14.04 with a Python 3.4 virtualenv with a complete scikit-learn stack. It helped convince me that the ability to easily switch on/off cloud-based machine learning environments for rapid prototyping and investigations will be an important factor in popularising these approaches:
import spacy def bestClassifierModel(X_train,y_train): vectorizer = CountVectorizer(tokenizer=tokenizer, ngram_range=(1,1)) clf = Pipeline([('xform', TextTransformer()), ('vect', vectorizer), ('clf', LogisticRegression(random_state=0)) ]) params = {'vect__stop_words':None,'vect__tokenizer':tokenizer, 'vect__ngram_range':(1,1), 'clf__C':10, 'clf__penalty':'l2'} clf.set_params(**params) clf.fit(X_train, y_train) return clf, vectorizer # --------------------------------------------- maybeLoadReviewsToCsv('movie_data.csv') df = pd.read_csv('movie_data.csv') # ----------- Try out classifier -------------- N = 10 print("Checking first %d reviews:%s" % (N,df.head(N))) parser = spacy.en.English() print("Sentences:") for i,review in enumerate(df.review[:N]): corpus = '.'.join(df.review[:N]) p = parser(corpus) print(parser.vocab.vectors_length) for i,sent in enumerate(p.sents): print("---- sentence %d ----" % i) print(sent) for j,token in enumerate(sent): print(token.orth_,token.pos_,token.lemma_) # ----------- Try out classifier -------------- ratio= 0.75 X_train, X_test, y_train, y_test = splitTrainingAndTestData(df,ratio) print("Fitting...") clf, vectorizer = bestClassifierModel(X_train,y_train) testscore = clf.score(X_test,y_test) print("----\nTest accuracy: %.3f\n----" % testscore)
It took approx 12 minutes to run this code and the test accuracy I achieved using this approach was surprisingly low (well under 90%). It’s possible I’m not doing something right. If I get the chance I’ll try to look into why a ‘simple’ unigram classifier apparently performs better on classifying IMDB review sentiment than spaCy.
Artificial Intelligence and Robots
- ZDNet outline the three reasons why AI is taking off right now: i) big data, ii) inexpensive commodity GPU hardware and iii) public cloud and also present a 10 point plan to allow your business to take practical advantage of the coming ‘Enterprise AI’ wave:
1. Embrace the idea that machine intelligence will matter to your organization.
2. Identify which forms could be most important to your firm.
3. Check out relevant start-ups and developments.
4. Understand which parts of your firm could be safely run by algorithms.
5. Determine which internal and external data sets have the most potential.
6. Assess the extent to which your firm’s key professional expertise can be automated.
7. Try out deep learning, neural computing and other technologies.
8. Map the relevant AI services and technologies to your firm’s value chain.
9. Develop machine intelligence experts in your organisation.
10. Factor AI advances into your strategic planning.
- Google Director of Research Peter Norvig’s excellent recent EmTech keynote on the the state of the art in AI conveys a similar message of serious change looming in the near distance for any remaining businesses that assume it doesn’t affect them:
“Every industry will feel the impact of both established and emerging artificial intelligence techniques.”
- Google are well-placed to profit from these developments but it hasn’t all been plain sailing to date even so. Their Boston Dynamics acquisition in particular seems to have been a something of a cultural mismatch from the outset with conflicting visions for consumer robotics between ‘us and them’ according to this illuminating Techinsider exposé. It appears they will now be parting ways with Toyota Research rumoured to be first in line to acquire the Atlas baton with all that entails:
- Post IO, it is clear Google’s focus has shifted to TensorFlow and “AI as a platform”. This includes the provision of specialised hardware support in the form of the TPU which has been well-received by industry analysts puts them in direct competition with NVidia.
- Some companies are already taking dramatic steps to accommodate AI advances with seismic impact on their human workforce. The news that Foxconn has ‘replaced 60,000 factory workers with robots‘ is a good example and was widely covered last week.
- Apple too are being drawn into the fray if this week’s reports that they intend to open up Siri to third party developers and develop an “AI speaker” competitor to Amazon’s Echo are to be believed. Last week’s blog highlighted some pointed scrutiny Apple have garnered since IO over their specific response to Google’s multiple-pronged assault on the AI and Machine Learning high ground. According to The Information which first reported the development:
the Web-connected Apple speaker in development would allow people to turn on/off or change settings for home appliances and other devices that support Apple’s “HomeKit” software, says the person with direct knowledge of the project. Such items include lights, sensors, thermostats, plugs and locks.
- This rock sorting robot is admittedly less centre stage in application than a Siri speaker, though far more interesting to watch:
- Doubtless driven by a need to keep up with DARPA, Russia too has a humanoid military robot “Iron Man” proposition Vice are calling ‘Ivan the Terminator’ to show off with a suitably bleak realpolitik behind his reason to be:
There are over 80 different countries that have military robots. They aren’t doing this because they think it’s cool. They are doing it because they think it gives them some kind of advantage in either a current conflict or a future conflict.
https://vimeo.com/166726214
- Apparently Google may try to ‘shame’ OEMs into keeping up to date with Android releases. A massively fragmented OS update picture is a major weakness for the platform in relation to Apple’s iOS as demonstrated by the stark version landscape difference below. Google now appear to have decided enough is enough:
It has drawn up lists that rank top phone makers by how up-to-date their handsets are, based on security patches and operating system versions, according to people familiar with the matter. Google shared this list with Android partners earlier this year. It has discussed making it public to highlight proactive manufacturers and shame tardy vendors through omission from the list
- A key practical obstacle they will face is that most Android OEMs are already in a very tough business and it’s far from clear that shaming them on releases will make any difference to their behaviour. Google may find that offering to provide specialist update engineering support to cash-strapped manufacturers would go down better.
- A factor that will compound OEM belligerence is that evidence is emerging that Android Wear watches are not proving to be a compelling product proposition in spite of Google’s major v2.0 update of the platform announced at IO. At least that seems to be the conclusion Samsung have reached in apparently opting to switch to Tizen instead for various reasons including interestingly battery performance. Google announced a major update of the platform to v2.0 at IO.
- In the light of the above, it will be interesting to see if the current tranche of OEMs express any genuine interest in getting involved in manufacturing components for Project Ara, Google’s modular smartphone initiative which is continuing to generate considerable analyst interest following IO.
- Still, at least Google can be pleased and relieved with the positive conclusion of the Oracle Android fair use legal case and said as much in their public response:
“Today’s verdict that Android makes fair use of Java APIs represents a win for the Android ecosystem, for the Java programming community, and for software developers who rely on open and free programming languages to build innovative consumer products”
- Google also announced a free cloud-based tool called Data Studio 360 intended to help users with friendly visualisation support. It could prove a significant competitor for the likes of Tableau and Qlik in due course both of which are distinctly non-free:
Asia
- TechInAsia on how WeChat’s wildly successful China recipe has largely failed to translate to the rest of the world. They hired Lionel Messi to spread the word and the normally reliable striker failed to deliver:
Of the 15 countries where WeChat’s Messi ad was aimed at – Argentina, Brazil, Hong Kong, India, Indonesia, Italy, Malaysia, Mexico, Nigeria, the Philippines, Singapore, South Africa, Spain, Thailand, and Turkey, points out Ad Age – the results look grim for the company. In not one single country does WeChat appear to be ahead of WhatsApp or Facebook.
- Luckily, thus far, so has the Chinese model of internet sovereignty though it has been a ‘scary’ success across China and for WashPo offers this brutal reality offers a sobering Orwellian lesson for techno-utopians:
“The Internet is as much a tool for control, surveillance and commercial considerations as it is for empowerment.”
- The Indian concept of jugaad or ‘frugal innovation’ has been highlighted before. It has reached its ultimate incarnation in the reusable mini space shuttle which may give Space X et al a run for their money in terms of costs. Perhaps future commercial space projects will get outsourced to India not just the software systems behind them:
The Internet of Things
- Sticking with jugaad, The Economist on the rise of ever more deadly ‘agile’ weaponry and how Syria is proving an ideal ground for prototyping lethal cheap propositions.
- Passive WiFi could become “a new standard for the Internet of Things“.
Engineering students and professors at the University of Washington have developed a way for passive IoT devices to receive energy via an RF carrier wave transmitted by an active WiFi power source. One central power source plugged into a wall outlet cpran transmit energy wirelessly to the passive devices, allowing sensors to send data without the need for power-hungry RF circuitry onboard.
- It was widely reported this week that the Raspberry Pi3 is “to become an officially supported device” for Android (AOSP). In actual fact however, the story is a classic fuss about nothing for now because at the time of writing that’s all there is in the corresponding source repository:
Apps and Services
- Trello is much-loved hobby project management tool. It’s also a serious potential business with over 1.1 million DAU and a footprint in “over 70% of the largest companies in the US“ but remarkably few of them paying anything for using the tool. The combination of Trello and Slack is a classic startup bootstrap combination.
- FastCoCreate on “how to make chatbots that are actually worth talking to“. Giving them personalities, collecting the data and keeping the engagement fun are among key recommendations.
- Business Insider’s article on Dropbox’s epic hosting change highlights the technically complex and largely unseen and unsung work involved in a comprehensive ‘life and shift’ of infrastructure hosting provider. Few shifts can beat the Dropbox one for scale – their move from AWS to their own DC was one necessitated by their gargantuan global scale and storage appetite. Dropbox built their own equivalent of AWS EC2 called Magic Pocket as part of the transition. And apparently they had a ‘no mistakes’ policy on this project. If you’re interested in more background, this Wired article written by Cade Metz on the Dropbox ‘exodus’ is worth checking out.
Anti Mobile
- Evidence that excessive phone use is seriously harmful to your ability to concentrate and may even be responsible for inducing ADHD-like symptoms comes in this Medium post from Tristan Harris who apparently did a stint as “Product Philosopher” at Google. That sounds like an intriguing job. Wonder how much practical remit came with that territory. The article is full of useful tips on how to turn your phone into a genuine minimum viable product:
- Minimize Compulsive Checking & Phantom Buzzes
- Minimize Fear of Missing Something Important
- Minimize Unconscious Use
- Minimize “Leaky” Interactions (“leaking out” into something unintended)
- Minimize Unnecessary Psychological Concerns generated by the screen.
- Relatedly, a blogger in the Guardian suggests that by focussing on taking and uploading images of an event rather than experiencing the event per se, essentially constitutes contributing to one’s obituary rather than living. Somehow many of us seem conditioned to believe that this simulacrum is somehow ‘even better than the real thing’ as Umberto Eco referred to it in Travels in Hyperreality. Whenever this topic comes up (as it regularly does) I am often transported back to Hyde Park in 1997 and the sight of thousands of people glued to the giant screens showing Princess Diana’s funeral cortege travelling up Park Lane seemingly oblivious to the fact that the real thing was trundling past a short distance to their right. It felt shocking then and would have been inconceivable for previous generations but it seems entirely accepted now to behave in this way.
- An all-glass iPhone is in the offing. Besides the obvious irony of one of the most notoriously secretive tech companies producing the ultimate transparent device, it’s not clear (see what I did there) what the user benefit of a glass phone is.
- Microsoft meanwhile gave up all pretence of claiming Windows Phone was still a viable platform with an announcement that was the equivalent of taking an axe to their mobile ambitions in the shape of a $950million writedown. It’s the final chapter of an acquisition that went wrong in a very big way resulting in pretty much a total wipeout of all Nokia employees within 4 years of their much trumpeted move to Microsoft. The full price will be disproportionately paid by Finland’s reeling tech sector for years to come:
Nokia and Microsoft have slashed thousands of Finnish jobs over the past decade, and the lack of substitute jobs is the main reason for the country’s current economic stagnation.
- As OEMs continue to churn out new devices, an ever-growing baggage trail of legacy phones is left in their wake. Dietrich Ayala, an engineer working at Firefox, outlines his neat solution for repurposing these legacy devices by turning them into IoT sensors with a combination of IFTTT and HTML5:
Software and Maths
- Interesting JavaScript hacks includes one that spins an ASCII globe:
- Why there is no such thing as a “small change” when it comes to software. One to cut out and keep above your desk:
Agreeing to features is deceptively easy. Coding them rarely is. Maintaining them can be a nightmare. When you’re striving for quality, there are no small changes.
- And yet the one thing you can be sure of with software is that everything absolutely will change around you:
I felt like saying this. pic.twitter.com/mHJ1rENoX1
— Hisham (@hisham_hm) December 13, 2015
- The mutability of software stands in stark contrast with the immutability of mathematics. This excellent post explains why is so important to keep your mathematical skills fresh irrespective of the type of coding you do in the day job. All the more so in the current development landscape:
software development is quickly shapeshifting. If you discount mathematics, and in turn focus on learning transitory programming tools, you’ll be left without the skills necessary to adapt to emerging computer science concepts that have already started infiltrating engineering teams today. … In the next 10 years, software engineers aren’t still going to be limited to programming web and mobile apps. They’ll be working on writing mainstream computer vision and virtual reality apps, working with interesting cryptographic algorithms for security and building amazing self-learning products using machine learning. You can’t go very far in any of these fields without a solid mathematical foundation.
- Going back to basics, here’s an interesting presentation of a more intuitive notation for power/root/log relationships built around the “triangle of power” with the three numbers involved in the relationship at the vertices and the edges between them defining their specific relationships:
Work and wellbeing
- Illuminating and thought-provoking view of the key takeaways after six months of “experimenting with my life” leads the author to a Thoreau-like transcendentalist conclusion:
The biggest takeaway for me, is learning … that in order for me to truly feel the pulse of life, to experience that fullness of life, I just have to keep treading into the unknown, to keep letting go of what I know, in order to come close to knowing life for what it is.
a recently published study of 16,426 working adults in Norway found that those with workaholism are significantly more likely to have psychiatric symptoms.
Authoritarianism
- The unmasking of Peter Thiel as the hidden hand behind a apparent systematic campaign to destroy Gawker Media disturbed many in Silicon Valley:
What is most frightening about this lawsuit is that the press has always played a significant role in defending the small and powerless against the big and powerful. Gawker has played this role in its own tabloid style, but Thiel’s funding of this lawsuit shows how money can protect that power through third-party litigation funding.
- And led some to question his suitability to be a member of the Facebook Board of Directors. Doubly so given his position as a Trump delegate:
Thiel’s secret laundering of the Gawker lawsuit disqualifies him as someone who should be on a board of directors of any organization that claims to value freedom of expression. Facebook’s other directors, employees, and users should ask how much they want to be associated with a company that keeps someone like Thiel in a position of such power and influence.
- Swirling underneath it all is the question of whether Trump’s politics constitutes fascism or not. The NYT summarized the current opinion landscape on that which includes this attempt to describe what’s going on without invoking Hitler:
“It seems to me in developed and semi-developed countries there is emerging a new kind of politics for which maybe the best taxonomic category would be right-wing populist nationalism”
- The New Yorker’s Adam Gopnik has been painting a progressively dire vision of the US under President Trump over the last few months. His latest article is his most urgent yet urging US voters not to accept “a declared enemy of the liberal constitutional order of the United States”. One suspects that he’s preaching to the converted with those that need converting simply not able or willing to listen:
If Trump came to power, there is a decent chance that the American experiment would be over. This is not a hyperbolic prediction; it is not a hysterical prediction; it is simply a candid reading of what history tells us happens in countries with leaders like Trump. Countries don’t really recover from being taken over by unstable authoritarian nationalists of any political bent, left or right—not by Peróns or Castros or Putins or Francos or Lenins or fill in the blanks. The nation may survive, but the wound to hope and order will never fully heal.
- Mike Godwin author of the infamous eponymous Law of Hitler Comparisons feels compelled to clarify his most famous meme following its dramatic invocation by two previous London Mayors as well as a range of assorted anti-Trump commentators:
the purpose of Godwin’s Law was never to be predictive — instead, I designed the law to create a disincentive for frivolous or reflexive Hitler or Nazi comparisons so that, when we do feel compelled to make them in our arguments, we are more likely to be mindful about them.