A year ago I gathered a list of the main factors for working with large sites and online media. Now decided to share their experience. Maybe someone will find it useful.
The requirements of search engines
- the scanning speed
- stable response
- accessibility of old documents
- small % error (5xx, 4хх)
- unique content
- no duplicates
- the markup of the document
- mobile version
- the markup of the document (profile)
- effective map parameters in the url
- stable politics mirrors
- the policy of moderation of UGC (user generated content)
- is there any newcomers in the segment of interest of the market?
- how to change the number of entry points and why?
- what are the dynamics of traffic by segment?
- if some segment of the audience is suddenly lost, then one of the competitors will take him?
- if the number of LP will be increased in 2 times, on how to increase traffic?
- how many LP benefit?
- how many topics in the content, and which ones bring the most traffic and conversion?
Myths and misconceptions
- a lot of static content = stable traffic
- a large index is more stable
- download speed can be neglected
- sitemap everything
- a lot of excluded pages, not my problem
- the url structure is at the discretion of the team-lead
- response 502/504, the bot not wait
- not write access.log for crawlers
- extension semantics without updating the collection
Great site, with the position of webmaster
It is a resource of 500 thousand documents in the index, the number of scanned pages per day – from 30 thousand, the number of entry points not less than 20 thousand per day, the search traffic from 300 thousand per day, week changes in traffic of 0.2 million
But the search engines another view on large sites
- A large set of text documents (example: Ответы@Mail.ru)
- Multiple use of background, noise words (content social. networks)
- Crossing the narrow subset of documents (posts in FB)
- The restoration of the hierarchy of the collection (content the social networks)
- Analysis and classification of a tonality of a text document (reviews, tweets, etc.)
- Categorization, classification, processing annotated (large – approx. Books with table of contents) text documents
- Analysis and aggregation of news flow based on the extracted semantic information
- Organization of work of the fetcher (quotas and rules example: ^/[^/]*/[^/]*/page\-[0-9]+\.html$)
- Duplicate detection, purification of the index (example: social profiles)
- Hardware training in big data
- Load distribution and daily quotas (mobile version + https)
From the position of machine learning, having a great website is
- It is difficult to determine which topics each document relates (if no RDF or conventional dictionaries hCard, hAtom, etc.)
- There is the difficulty of determining the number of statically are distinguishable by the fact
- Partial training (new generalization) on small parts of the collection. Mathematical models are convenient, but the data obtained are extremely weak linguistic justification
- When the relative perplexe (depending on the capacity of the dictionary and the frequency distribution of words in the document collection) decreases as razresevanje dictionary. This is because the topics are not identical in power. In case of accidental dilution of the dictionary small topics become statistically insignificant and cease to be identified
- With a smaller number of topics relative perplexity increases as razresevanje dictionary. Presumably, this is due to the fact that the thematic model internally combines main topics, the differences between the combined themes are insignificant, the themes converge and become more similar to unigrams model collection.
The difficulty lies in the fact that the number of created content on these sites is greater than the amount indexed. There are incomplete moderation of UGC, a large number of duplicates, the intersection of different types of markup, pagination and multiple sorting of a large number of user profiles and social graphs, but also needs to be crawling given the rights to video (geo ip).
Features large sites
Moving to https, change the mirrors, download speed, taking into account the scanning speed (night crawl), the stability of the ligaments query-document (ugc, file, loss of an index), exorbitant js + CSS, a lot of standard pages, different types of pages depending on the sign in/out a lot of content, a lot of those a lot of competitors, the price of failure ~ N million documents.
Unbarred recommendations for large sites
- Mandatory marking of documents according to content type
- Prevent indexing of potentially weak documents
- Prevent indexing of documents with a short life cycle
- Consideration of the dynamics of the daily quota of crawlers with reference to segments of the site
- Do not use the new domains (new TLD)
- Do not use the domains of the lower level to increase the collection of documents and traffic
- To monitor the load of DNS servers
- Do not delete the notification documents (for example the DMCA, not pirated content)
- Not to create issues with viewing the site on mobile devices
- Minimization of similar documents into thematic buildings
- Do not use slang or typos in the annotation and markup
- Identity internal links for all versions of the site
- Do not use Landing Pages video content with geo restrictions
- Do not use an HTTP 3xx code inside documents of the mirrors (not the main host)
- Keep a log of all entry points (source search engine). To aggregate data into clusters
- To monitor the dynamics of changes in the number of indexed but have no traffic documents
- To reject links if they do create problems
Task semantics include:
- The semantic core in the correct format. Every word must be given the frequency of use in quotes and without quotes for Wordstat.yandex.ru.
- Grouping of queries.
- Regular update of the kernel.
- Parsing all sources.
In terms of work content and unique landing pages:
- The actual problem of produplicator.
- It is important to load part of news articles, reviews.
- Necessary to prescribe the titles, work on unique content.
- To distribute the queries on titles + templates + low-frequency keywords.
- It is important to work with noindex/JS.
Tasks while working on the structure of the website include:
- Working on the pagination.
- Work on tagged pages (pages that are designed for queries, but not in the structure of the site).
- Work on linking. For Yandex this work is rooted in the past, because the internal links for commercial queries in Moscow are not taken into account.
Task dimension can be reduced to the following:
- Grouping queries – visibility, traffic, conversion groups.
- Search pessimizirovat pages + Automatic criteria for the quality of the text.
- Check queries on the commercial part, off reference.
- Snippets, microformats, Islands.
- Analysis of competitors.
Commercial factors (Yandex section)
Factors, which influence on the position of the Search is recognized as the most powerful:
- the terms of service
- the online order form
- social network
- the schedule
- description of payment methodsthe presence of Yandex.Market
- the presence of Yandex.Directory
Just influence, i.e. needed to gain a good position:
- addresscustomer reviews
- page discounts-promotions
- company details
- about page
- the conditions of return and exchange
- product photos in high resolution
On the position of the site in Search probably does not affect such commercial factors as:
- local phone
- item description
- online assistant
- conditions of delivery in the main area
- conditions of delivery in other regions
- search the site
- feedback form