- In part 1 of this series Check your site for broken links in SharePoint Online, I looked at going through all my sites within a site collection.
In this post I’m continuing with the implementation of the Get-WebForBrokenLinks.
[code lang=text]
Get-WebForBrokenLinks -Web $subweb
[/code]
Before we can have a look at finding all broken links within a site, we will need to identify where the broken links may be stored. A quick look at SharePoint gives me the following lists
- List items
- Pages
- Documents in libraries
- Web Parts
For now I’m going to look at the easiest option. List items.
I’m going to start with the function. And I’m making the Lists available using the Load and ExecuteQuery:
[code lang=text]
Function Get-WebForBrokenLinks {
[CmdletBinding()] param( [Parameter(Mandatory=$True,ValueFromPipeline=$True, ValueFromPipelineByPropertyName=$True,HelpMessage=’Web to be scanned for broken links’)] [Microsoft.SharePoint.Client.Web] $Web )
begin{
Write-Host “Scanning: ” $Web.Url
}
process{
$web.Context.Load($web.Lists)
$web.Context.ExecuteQuery();
… # This is where the rest of the code needs to appear
}
end {
Write-Host “Compelted scanning: ” $Web.Url
}
}
[/code]
Now I need to go through the lists and the list items
[code lang=text]
ForEach ($list in $web.Lists) {
$items = Get-PnPListItem -List $list
foreach ($item in $items) {
….
}
}
[/code]
So now I’m getting the items for all of my lists. Now it becomes important to understand what type of fields SharePoint has as we step through all the fields in all the items of all the lists.
[code lang=text]
foreach ($fieldValue in $item.FieldValues){
foreach ($value in $fieldValue.Values) {
if ($value -ne $null) {
switch ($value.GetType().Name){
….
}
}
}
[/code]
Now all we need to do is handle all data types that may contain urls. So what are the data types? And which ones could possibly contain a url?
To find this out I added a default option to my switch:
[code lang=text]
default {
$type = $value.GetType()
Write-Error “Not supported type: $type”
}
[/code]
Then I kept rerunning my script until I collected all the datatypes. I found the following data types in my lists:
- Guid
- Int32
- ContentTypeId
- DateTime
- FieldUserValue
- FieldLookupValue
- Boolean
- Double
- String[]
- FieldUrlValue
- String
Most of these couldn’t possibly contain a url. e.g. Guid. So building up my switch I get the following script:
[code lang=text]
switch ($value.GetType().Name){
“Guid” { # Ignore }
“Int32” { # Ignore }
“ContentTypeId” { # Ignore }
“DateTime” { # Ignore }
“FieldUserValue” { # Ignore }
“FieldLookupValue” { # Ignore }
“Boolean” { # Ignore }
“Double” { # Ignore }
“String[]” { …
}
“FieldUrlValue” { …
}
“String” { …
}
default {
$type = $value.GetType()
Write-Error “Not supported type: $type”
}
}
[/code]
Ok, so so far I only need to write some code for 3 field types. I’m going to start with FieldUrlValue. The reason why this type is easier than String is because the String field may contain other text as well:
[code lang=text]
if ($value.Url.Contains(“https://”) -or $value.Url.Contains(“http://”) ) {
try {
if ((invoke-webrequest $value -DisableKeepAlive -UseBasicParsing -Method head).StatusCode -ne 200){
Write-Host “Broken link:” $value.Url
}
}
catch
{
Write-Host “Broken link:” $value.Url
}
}
[/code]
So we are now ready to answer the next critical question. How do I recognize a Url in text. I’ve seen solution with Regular expressions. And although this might be a good way ( if you can get it to work!) I’m hoping that I have found an easier way.
It’s all started by assuming that a Url doesn’t contain a space. So if I have a text with a url then a split by space would give me an array:
[code lang=text]
$string = “text https://sharepains.com/anylocation/anypage.html some more text”
$string.split(” “)
[/code]
Ok, This will almost work, but not if there isn’t a space before or after the url. So other than spaces what else could be splitting urls from text.
I’m first having a look at the html
[code lang=text]
<a href=”http://testurl”>Link</a>
[/code]
As all I’m interested in is getting a variable with a clean url in it, I could just split by ” as well.
for string fields this results in the following piece of code:
[code lang=text]
if ($value.Contains(“https://”) -or $value.Contains(“http://”) -or $value.Contains(“http://”) -or $value.Contains(“https://”) ) {
try {
$words = $value.split(” “)
foreach ($word in $words) {
$quotesplitwords = $word.split(“`””)
foreach ($quotesplitword in $quotesplitwords)
{
if ($quotesplitword.Contains(“https://”) -or $quotesplitword.Contains(“http://”) -or $quotesplitword.Contains(“http://”) -or $quotesplitword.Contains(“https://”) ) {
if ((invoke-webrequest $quotesplitword.Replace(“:”, “:”) -DisableKeepAlive -UseBasicParsing -Method head).StatusCode -ne 200){
Write-Host “Broken link:” $quotesplitword
}
}
}
}
}
catch
{
Write-Host “Broken link:” $quotesplitword
}
}
[/code]
This code now only gives one false positive:
If urls appear in text, without these being actual clickable hyperlinks then the script will flag them up. Actually any text that contains http will be flagged up as a broken link. Well for now I’m going to decide to live with that. Not sure though if this will be ok for the leftover locations that may contain broken links.
So this now covers finding broken urls within list items. there is still quite a bit of work to do.
- Pages
- Documents in libraries
- Web Parts
But these elements will be done within the next part of this series. Now that we have code that finds Urls within text we are half way there.
Can’t wait for part III!
Sorry, a had a client that needed this a few years ago. But we never got to fully implement the whole solution. If I get a client whobdoes want the rest as well then I might complete this series.
Using Invoke-Webrequest, I am able to find broken SP sites but for site pages it is still returning me 200 status for broken pages , Any suggestions?
You could try and read the page and look at the content. The non existing page will not have any content.
I guess there was no part 3? This definitely got me started, and I got it all sorted out for the most part now, but I found that the lists were harder to filter through than regular pages. Especially fields that were multi line or allowed for the special features. No matter twhat you do, the text comes out with multipl lines and won’t filter properly. I ended up making a scratch file to dump the text to and then to do a get-content … – raw.
Indeed, I never got to finishing this series as the client didn’t want this to be done.