Sitecore Content Tagging & Azure Text Analytics
To implicitly profile / understand site users, the more we know about our content the better. And by the more we know, what I really mean is taxonomy and meta data associated with our content that we can analyse to draw educated conclusions about the nature of the content our site users are interested in and the tasks they are trying to accomplish.
In the best case scenario, all of our site content would be nicely tagged with relevant and well designed meta data, controlled and uncontrolled vocabulary and organised within a carefully designed and effective information architecture. But as we know that is often not the case, it's simply too much of a hands on and ongoing effort for most teams and in a lot of cases content is also produced by third party agencies or freelancers who may not be as dialled in to the nuances of the data model and entities that are important to the organisation.
As developers, this is an opportunity to put our thinking caps on to see where technology can help.
All three major cloud providers (Google, Amazon, Microsoft) provide machine learning based content tagging services and they all look pretty similar. The premise is this, give us your unstructured content, we'll analyse it and send you back structured data in the form of sentiment analysis, recognised entities (people, places, organisations, objects), recognised topics, key phrases etc. The focus of this post is on Microsoft's Text Analytics service, which is part of Cognitive Services.
The fastest way to get started and see what Text Analytics can do for you is to visit the home page and provide the demo with some of your content:
But I'm going to jump straight into how we can integrate this service with Sitecore. The good news is that Microsoft offers a free tier with Text Analytics, so assuming you have an Azure account (if not sign up for a free trial) go ahead and follow the documentation to create a Text Analytics resource in your Azure subscription, which will give you a Text Analytics service endpoint for your region and the API keys you need to work with the service.
I'm going to integrate Text Analytics by subscribing to the publish begin event so as editorial teams make changes to and publish new content, it's automatically tagged up before it makes it to the content delivery roles. So first we need a handler:
First I add a handler to the publish begin event. In the handler, I extract the item being published and then call an Item extension method, which takes care of the rest:
To explain this. Azure Text Analytics has separate APIs for getting linked entities and key phrases in relation to a piece of text. So for the item being published we are calling both of these APIs and storing the responses in separate fields on the item.
A couple of things to note:
- Before sending text to Text Analysis we must convert it to plain text, so if you are sending rich text content you will need to convert, handily Sitecore has a utility method to do just this TextUtil.StripHtml
- We can send up to 1000 documents at a time for analysis. You may choose to bundle fields on the item and select fields on component data sources into a single document or send fields on the item and fields on the data sources as separate documents, it depends on how you structure your content. In either case 1000 documents should be way more than enough
- Each document you send to Text Analysis counts as a transaction, your free tier is based on up-to 5000 transactions every 30 days. So think about this when choosing how to bundle your content
- The maximum length of an individual document is 5000 characters
- Text Analysis works better with larger volumes of text, so take this into consideration when thinking about how you want to package up your content into documents
- Text Analysis works better with English currently, so expect better results with English. You can send non-english content for analysis but you will need to map Sitecore's culture codes to the codes expected by Azure
To take a closer look at one of these APIs:
The first thing it does is build a web request using the RestSharp library available via Nuget, this pulls things like the Text Analysis base url, API keys and key phrase analysis endpoint from include file settings. It then goes on to build an object (TextAnalysisRequest) that can be serialized to Json in the format that Azure Text analysis expects and contains the content we want to analyse, which is a combination of item level field values and selected field values on select renderings added to the final layout of the item (components). In this example, each component is sent as a separate document:
There's a bunch of hard coded GUIDs in here (NEVER DO THIS!!!), which is partly why this code's not on GitHub yet but I will sort that out soon and make the whole thing available to clone/download, it's Helix architecture based so should be OK to include it any solution. The concept is fairly straight forward, given the page item being published and the data source items associated with select renderings on the page item (using the ExtractRenderings extension method), get the fields that map to the items template (the fields we want to send for analysis), concatenate them together, strip out the HTML and return the text up-to 5000 characters.
So now every page that gets published is being sent off to Cognitive Services and its machine learning based algorithms to analyse the content and automatically tag items. This data can then be used to drive personalisation, automatic profiling (see my Profile Mapping module) or in any way you see fit for your needs.
I haven't spent a whole bunch of time looking at it but Sitecore 9.1 comes with an out of the box integration with Open Calais , which includes content tagging abilities. As soon as I get some time I will look deeper into that and report back with findings.