As SEO experts, we use many tools on a daily basis and we can get different outputs with each tool. Crawling tools are undoubtedly the most important ones among these. This is because, we can run a crawl on certain pages or a whole website using crawling tools, and we can quickly detect technical problems or deficiencies through the outputs provided by the relevant tool. Some of these crawling tools are desktop applications, while others run cloud-based. You need to download desktop tools to your computer to use them and they use your computer's hardware while crawling. But you don't need to download cloud-based crawling tools like Deepcrawl to your computer. Just go to deepcrawl.com and log in with your username and password. Your computer's hardware is not used while running a crawl with Deepcrawl and similar tools. The process here takes place entirely on Deepcrawl servers.

In this content, we will try to address all the details about Deepcrawl. Let's start with how you can start a crawl for your website first.

How to Start a Crawl Using Deepcrawl

After logging in to deepcrawl.com with your username or e-mail address and password, you will see the following screen. To start a crawl, you must first click on the "New Project" button in the upper-right part of the page.

Here you will see the "Domain" screen, which is the first part of the crawl initiating process. Just enter the name of the domain in the first field and the name of the project in the second one. Then, if necessary, you can also activate the Javascript rendering setting. After completing these steps, click on the "Save and Continue" button.

Then you will see the second part, the "Sources" screen. I will try to explain the settings here and what they do one by one.

1) Website: Here, you can enable all sub-domains to be crawled, as well as HTTP and HTTPS pages.

2) Sitemaps: After entering the domain on the first screen, Deepcrawl finds the sitemaps of your website. Make sure that sitemaps are selected so that no URL is overlooked. A different sitemap can be uploaded here or any sitemap you want can be included.

3) Backlinks: Here, you can use the Majestic tool to include the URLs that you obtain backlinks from in the crawl, or you can upload a manual backlink list. Also, when you add this field in the crawl, you may also see outputs containing backlink data in some reports.

4) Google Search Console: Here, you can also include Search Console data in the crawl by selecting the Search Console property. Thus, you minimize the possibility of pages going unnoticed which are not linked from within the site or which Deepcrawl bots cannot access for various reasons. If you cannot view your properties in this area, you can add your Google account with your Search Console property in the upper right section of the page.

5) Analytics: You can also select your Analytics account here. In this way, you will enable Deepcrawl bots to explore these pages as well, if you have pages that are not linked to your site. Furthermore, as a result of the crawl, you may also obtain outputs containing your Analytics data in some parts.

6) Log Summary: In this section, you can use the summary data of log file analysis tools such as splunk and Logz.io or manually upload your log files and include the data in the crawl.

7) URL Lists: Lastly, in this section, you can upload a manual list containing the URLs you want to be crawled.

Thus, we have gone over all the settings related to which resources will be included in the crawl. Once you adjust the necessary settings in this section, you can click the "Save and Continue" button at the lower part of the page again.

This will bring you to the third section, the “Limits” screen. In this section, you can specify the number of URLs to be crawled per second and the maximum total number of URLs to be crawled. When the limit specified in the subsection is reached, you can select one of the options to receive a notification or to complete the crawling process regardless. After adjusting the settings here, click the "Save and Continue" button once again.

This would bring you to the last and fourth section, the "Settings" screen. So far, we have made several important general adjustments. If you also want to view the advanced settings, you need to click the "Advanced Setting" button. Here, you can adjust several detailed settings including selecting additional domains or subdomains that you want to include in the crawl, setting the URL paths you want to be included or not included in the crawl, and selecting the user agent to run the crawl. Once you make all the adjustments, click on the "Start Crawl" button to initiate the crawl.

After starting the crawl, you will see the following screen. On this screen, you can see the number of URLs crawled in real-time, and if you wish, you can pause, stop, or delete the crawl. If you are not going to do one of these, you do not need to wait on this screen. Once the crawl is complete, Deepcrawl will send you a notification e-mail.

Deepcrawl Dashboard

Once the crawl is complete, you will view the following Dashboard screen. Let's briefly talk about the sections on this screen.

You can see the primary problems of your site such as all broken links, pages that are not linked anywhere, pages with 4xx status code. When you click on any of these links, you will see the URLs and details of that problem.

In this section, you can see basic details such as your primary pages, duplicate pages, pages with or without the status code 200, pages that cannot be indexed for any reason, and the related pie chart.

If a certain period of time has passed since your crawl and you want to see the changes on your site, you can start a new crawl by clicking the "Run Crawl" button in the "Changes" section.

As you know, all our pages that we want to be indexed must have the status code 200 (200 pages). In this section, you can see the pages that do not have the status code 200 (non-200 pages).

In this section, you can see your URLs that cannot be crawled along with the related justification.

We may want some of our pages not to be indexed due to optimizing the crawl budget or for any other reason. In this section, you can see your non-indexable pages along with the related justification. I recommend that you review your non-indexable pages here, and if you see that your important pages are not indexed due to any problem, you should intervene quickly.

In this section, we can see the pages that are not linked in any page on your site and are called "orphaned". Once again, I recommend that you review the pages here in detail. Then, if you have important pages here, you must ensure that they are linked from your other pages. If the pages here have gone unnoticed and are unimportant, you can consider options such as redirection.

In this section, you can check whether your pages are included in the specified sources. For example, in the section shown in the image below, you can access Search Console, Analytics, sitemap, and pages on the web or not on the web. Here, you need to create these links at the start of the crawl in order to view Search Console, Analytics, or Backlink data.

In this section, you can see duplicate pages, non-200 pages, pages that cannot be indexed, etc., and trend graphs of your pages.

Within this section, you can access the depth level graph of your pages. Thus, you can see quickly in which breakdown you have more pages, or the pages that are in the very bottom breakdown and gone unnoticed.

In this section, you can see the number of your HTTP and HTTPS pages. If you have HTTP pages, you need to ensure that all your pages are opened securely as HTTPS.

Here, you can see your indexable pages that get or do not get impressions according to Search Console data. In order to see the data here, you need to establish the necessary Search Console connection when starting the crawl.

In this section, you can view the click-through rate of your indexable and non-indexable pages per device. In order to see the data here, once again you need to establish the necessary Search Console connection when starting the crawl.

We gave brief information about the sections on the Dashboard screen that you will view first, once the crawl is complete. Now, let's talk about the important sections in the menu on the left that you should absolutely review.

- Issues

In this section titled "Summary", you can view the list of all the main problems that come out after crawling your pages.

- Changes

In this section, you can view the changes related to the problems on your web pages if you have run a crawl before.

- All Pages

Here, you can access the two basic tables and the list of all your pages on the Dashboard screen. When you click on the required options in the tables here, you can easily filter your pages with the relevant basic features or problems. In addition, you can easily export all your pages obtained after crawling.

- Indexable Pages

In this section, you can view all your indexable pages. You can also access your unique and duplicate pages among your indexable pages in the pie chart on the right.

- Non-Indexable Pages

In this section, you can access your pages that cannot be indexed for any reason. You can also see why these pages cannot be indexed in the graph on the left. I recommend that you review the pages here in detail. Otherwise, it may be possible that an important page or page group cannot be indexed due to an overlooked problem. In the sample crawl below, we see that all of the non-indexable pages cannot be indexed because they display different pages with the canonical tag. As you know, all the pages you want to be indexed need to refer to themselves with the canonical tag.

- 200 Pages

In this section, you can access your pages with the status code 200. As you know, all our important pages need to be served from the server seamlessly and with the status code 200. The graph on the right indicates the number of your pages with the status code 200 based on your previously run crawls.

- Non-200 Pages

Here, you can access your pages that do not have the status code 200. All of our pages that we want both users and search engine bots to reach must be served from the server with the status code 200. For this reason, you should review all the pages shown here in detail that do not have the status code 200. If there are pages with status codes 404 or 500 other than the redirections, you need to resolve the problems that cause this situation and ensure that the pages open up without any problems with the status code 200.

- Uncrawled URLs

In this section, you can access your URLs that cannot be crawled for various reasons and the reasons why they cannot be crawled. In the example below, we see that the reason why all URLs cannot be crawled is the disallow command in the robots.txt file. You should take a look at the URLs here and the commands in the robots.txt file. Otherwise, some overlooked pages or page groups may not be indexed.

- Primary Pages

In this section, you can access the list of your indexable and unique pages. We may say that the pages listed here are important primary pages for your website.

- Duplicate Pages

In this section, you can access pages with duplicate titles, description tags, and the same or substantially similar content. I recommend that you review the pages here, and customize the pages that you expect to get organic traffic in terms of title, description, and content.

- Self Canonicalized Pages

In this section, you can access the pages that refer to themselves with the canonical tag. It is worth mentioning that each page to be indexed needs to refer to itself with the canonical tag. However, you may still need to review the pages and page groups in this section. If there are pages that you do not want to be duplicated or indexed, you can change the canonical tags according to the case involved.

- Noindex Pages

This section, you can access the pages with the "noindex" tag. I recommend that you review the pages here, and remove the "noindex" tag if there are any overlooked pages that you want to be indexed.

- Canonicalized Pages

Here, you can access pages that point to a different URL with the canonical tag. You can review these pages and have them refer to themselves with the canonical tag if there are pages you want to be indexed.

- 301 Redirects

In this section, you can access the redirected pages with the 301-status code and the pages they are redirected to.

- Non-301 Redirects

Within this section, you can access pages with 302 status codes. If there are pages that need to be redirected permanently from the pages here, you can use the 301 redirect.

- 5xx Errors

In this section, you can access pages with 5xx status codes. You should examine the pages with this status code that have arisen due to problems on the server side, solve the related problems, and ensure that they are opened with a 200-status code.

- Broken Pages (4xx Errors)

In this section, you can access pages with 4xx status codes. As you know, all our pages that we want users and search engine bots to visit should be opened up without any problems with the status code 200. For this reason, you should solve the problem on pages with 4xx status codes and make the necessary redirects if the relevant pages will not be active again.

- Failed URLs

In this section, you can access failed pages. This problem may sometimes be that the relevant page does not open or it loads too slowly. In the example below, we see that the loading times of the pages are too long and they are opened with a "0" status code. The problems on the pages here should be fixed and the pages should open up without any problems with the status code 200.

- Content Overview

Here, you can access the graphics and the list of general problems related to the content. In addition, if you have run a different crawl before, you can view the changes in content-related problems from the side tab.

- Missing Titles

In this section, you can access your pages that miss a title tag. I recommend adding appropriate and unique title tags to the pages here.

- Short Titles

In this section, you can access pages with short title tags in character length. Surely, you don't have to extend the character length of the title tags of each page here. However, to use this area more efficiently, it may be useful to improve the title tags of certain pages or page groups.

- Max Title Length

In this section, you can access pages with title tags that are longer than required. Title tags longer than a certain pixel will appear truncated on results pages. For this reason, I recommend that you optimize the title tags of the pages here in terms of length.

- Pages with Duplicate Titles

In this section, you can access pages with duplicate title tags. I recommend that you add a unique title tag for each of the pages included here, especially for each page that is indexed and where you expect organic traffic.

- Missing Descriptions

In this section, you can access pages that do not have a description tag. Although description tags are not a direct ranking factor, they may affect click-through rates. For this reason, it is useful to add a unique description tag for each page that is indexed and has organic traffic expectations. Otherwise, Google will display a random text from within the page as a description tag.

- Short Descriptions

Here, you can access pages with short description tags as character length. Google usually does not reflect very short description tags on results pages and displays random text from within the page as a description. For this reason, it is useful to optimize the description tags of the pages that are indexed and have organic traffic expectations. Deepcrawl accepts description tags as short if they are fewer than 50 characters.

- Max Description Length

In this section, you can access pages with description tags that are longer than necessary. Description tags that are longer than a certain pixel will appear truncated on results pages. For this reason, I recommend that you optimize the description tags of the pages here in terms of length.

- Pages with Duplicate Descriptions

In this section, you can access the pages with duplicate description tags. It is useful to add a unique description tag for the important pages here.

- Empty Pages

You can access empty pages in this section. If these pages were created by mistake, we do not want users and search engine bots to encounter a blank page. For this reason, blank pages can be redirected, or if there is a problem, you can fix it and the pages can have rich content.

- Thin Pages

You can reach pages with thin content via this section. The development of the pages here will contribute to a better experience for both users and search engine bots. Especially if you expect organic traffic from these pages, you must make the necessary improvements in terms of content.

- Missing H1 Tags

In this section, you can access the pages that don’t have the H1 tag. In particular, I recommend adding a unique H1 tag that summarizes the page content for each indexed page and has organic traffic expectations.

- Multiple H1 Tag Pages

In this section, you can access pages with more than one H1 tag. H1 tag, which acts as the main title of the page, needs to be the only one on each page. For this reason, it is necessary to make the necessary arrangement so that there is only one H1 tag for the pages here.

- Canonical to Non-200

In this section, you can view pages where the URL referred to by the canonical tag does not have the status code 200. In this case, the relevant URLs referred to with canonical should be opened up without any problems with the status code 200 or the canonical tags should be updated.

- Redirect Chains

In this section, you can access pages with a redirect chain. This means that a page is redirected to another page, but the redirected page also redirects to a different page. In this case, search engine bots send a new request at each redirect. In order to avoid this problem, the redirect chains here should be canceled and a page should be redirected directly to the other page without a different redirect.

- All Redirects

In this section, you can access all redirects, types of redirects, and details about these redirects on the crawled pages.

- All Broken Redirects

In this section, you can access all the broken redirects where the redirected page does not open up with the status code 200. Here, it is necessary to fix the redirection problems, either to cancel the redirects or to redirect to a page that opens up without problems with the status code 200.

- HTTP Pages

In this section, you can access the HTTP pages on your site. You must ensure that all pages here are opened up securely as HTTPS.

- Broken Sitemap Links

In this section, you can access the pages on the sitemap that are not opened with the status code 200. As you know, you need to include pages that you want to be indexed in the sitemap and that are opened with the status code 200. The URLs here must be opened with the status code 200 or the relevant URLs need to be removed from the sitemap.

- Non-Indexable URLs in Sitemaps

In this section, you can access URLs that are not indexed but are still included in the sitemap. If the URLs here are not to be indexed, they need also be removed from the sitemap.

- Indexable Pages without Search Impressions

You can access pages in this section that are indexable but never get any impressions. In order to see the data here, you need to connect your Search Console account at the beginning of the crawl. It may be useful to make a general review of these pages. If there is a page or page group that does not have the potential to get impressions, it can be prevented from being indexed in order to optimize the crawl budget. In the opposite scenario, necessary improvements should be made in order for these pages to get impressions and clicks.

Conclusion

We tried to address all the important issues that you may encounter after running a crawl with Deepcrawl. Although some of these issues are minor technical problems, we can easily mention that some issues are major enough to directly affect the SEO performance of your website. For this reason, we recommend that you crawl your website with Deepcrawl periodically and solve these technical problems or remedy deficiencies in order of priorities.

 

Penned by Metehan Urhan - SEO Executive, Zeo Agency