Re: [BULK] Re: [WebDNA] set HTTP-Status Code from webdna

This WebDNA talk-list message is from

2016

It keeps the original formatting. numero = 112677
interpreted = N
texte = 260--001a11c3c6d2451a77052ecf47d1Content-Type: text/plain; charset=UTF-8Content-Transfer-Encoding: quoted-printableHi all,Thought I would add my approach to 'pretty' urls using mod_rewrite ratherthan routing through an error document.Basically everything except images and folders/files that I specify arerouted to 'parser.tmpl'. That template then parses the URL and you can thesearch databases, include files etc.Here's a sample htaccess file with all the mod_rewrite stuff and some otherthings that people might find useful.- TomPS. This a great resource on what can be done using the htaccess filehttps://github.com/h5bp/html5-boilerplate/blob/master/dist/.htaccess# Better website experience for IEHeader set X-UA-Compatible "IE=3Dedge"Header unset X-UA-CompatibleDirectoryIndex index.html index.tmpl# Proper MIME types for all filesAddType application/javascript jsAddType application/json jsonAddType video/mp4 mp4 m4v f4v f4pAddType video/x-flv flvAddType application/font-woff woffAddType application/vnd.ms-fontobject eotAddType application/x-font-ttf ttc ttfAddType font/opentype otfAddType image/svg+xml svg svgzAddEncoding gzip svgzAddType application/x-shockwave-flash swfAddType application/xml atom rdf rss xmlAddType image/x-icon icoAddType text/vtt vttAddType text/x-component htcAddType text/x-vcard vcfAddType text/csv csv# UTF-8 encodingAddDefaultCharset utf-8AddCharset utf-8 .atom .css .js .json .rss .vtt .webapp .xml# Security - Block access to directories without a default documentOptions -Indexes# Block access to backup and source filesOrder allow,denyDeny from allSatisfy All# Rewrite engineRewriteEngine On# Redirect to Main 'www' DomainRewriteCond %{HTTP_HOST} ^yourdomain\.com [NC]RewriteRule ^(.*)$ http://www.yourdomain.com/$1 [R=3D301,NC,L]# Exclude these directories and files from rewriteRewriteRule ^(admin|otherdirectories|parser\.tmpl|robots\.txt)($|/) - [L]# Exclude images from rewriteRewriteCond %{REQUEST_URI} !\.(gif|jp?g|png|css|ico) [NC]# Route everything else through parser.tmplRewriteRule ./parser.tmpl?requestedurl=3D%{REQUEST_URI}&query=3D%{QUERY_STRING}&serverpo=rt=3D%{SERVER_PORT}[L]=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DDigital Revolutionaries1st Floor, Castleriver House14-15 Parliament StreetTemple Bar,Dublin 2Ireland----------------------------------------------[t]: + 353 1 4403907[e]: [w]: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DOn 24 March 2016 at 17:50, wrote:> What about using [referrer] to allow your customers navigate your website> but disallow bookmarking and outside links? you could also use [session] =to> limit the navigation to X minutes or Y pages, even for bots, then "kick"> the visitor out.>>> - chris>>>>> > On Mar 24, 2016, at 20:30, Brian Burton wrote:> >> > Backstory: the site is question is a replacements part business and has> hundreds of thousands of pages of cross reference material, all stored in> databases and generated as needed. Competitors and dealers that carry> competitors brand parts seem to think that copying our cross reference is> easier then creating their own (it would be) so code was written to block> this.> >> > YES, I KNOW that if they are determined, they will find a way around my> blockades (I=E2=80=99ve seen quite a few variations on this: tor, AWS, ot=her VPNs=E2=80=A6)> >> > Solution: looking at the stats for the average use of the website, we> found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6> > I have a visitors.db. The system logs all page requests tracked by IP> address, and after a set amount (more then 14 pages, but still a pretty l=ow> number) starts showing visitors a nice Page Limit Exceeded page instead o=f> what they were crawling thru. After an unreasonable number of pages I jus=t> 404 them out to save server time and bandwidth. The count resets at> midnight, because I=E2=80=99m far to lazy to track 24 hours since the fir=st or last> page request (per IP.) In some cases, when I=E2=80=99m feeling particular=ly> mischievous, once a bot is detected i start feeding them fake info :D> >> > Here=E2=80=99s the Visitors.db header: (not sure if it will help, but =it is> what it is)> > VID IPadd ipperm ipname visitdate pagecount starttime> endtime domain firstpage lastpage browtype> lastsku partner linkin page9 page8 page7 page6 page5 page4> page3 page2 page1> >> >> > All the code that does the tracking and counting and map/reduction to> store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what (i=f> anything) I can share a bit later, and try to write it up as a blog post =or> something.> >> > -Brian B. Burton> >> >> On Mar 24, 2016, at 11:41 AM, Jym Duane wrote:> >>> >> curious how to determine...non google/bing/yahoo bots and other> attempting to crawl/copy the entire site?> >>> >>> >>> >> On 3/24/2016 9:28 AM, Brian Burton wrote:> >>> Noah,> >>>> >>> Similar to you, and wanting to use pretty URLs I built something> similar, but did it a different way.> >>> _All_ page requests are caught by a url-rewrite rule and get sent to> dispatch.tpl> >>> Dispatch.tpl has hundreds of rules that decide what page to show, and> uses includes to do it.> >>> (this keeps everything in-house to webdna so i don=E2=80=99t have to =go> mucking about in webdna here, and apache there, and linux somewhere else,> and etc=E2=80=A6)> >>>> >>> Three special circumstances came up that needed special code to send> out proper HTTP status codes:> >>>> >>> temporarily moved code on a redirect) =E2=80=94>> >>> [function name=3D301public]> >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]> >>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location:> http://www.example.com[link][eol][eol][/returnraw]> >>> [/function]> >>>> >>> crawl/copy the entire site=E2=80=94>> >>> [function name=3D404hard]> >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]> >>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not> Found[eol]Content-type: text/html[eol][eol][eol][eol]

404> Not Found

[eol]The page that you have requested ([thisurl]) could not> be found.[eol][eol][/returnraw]> >>> [/function]> >>>> >>> > >>> [function name=3D404soft]> >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]> >>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not> Found[eol]Content-type: text/html[eol][eol][include> file=3D/404pretty.tpl][/returnraw]> >>> [/function]> >>>> >>> Hope this helps> >>> -Brian B. Burton> >>> ---------------------------------------------------------> This message is sent to you because you are subscribed to> the mailing list .> To unsubscribe, E-mail to: > archives: http://mail.webdna.us/list/talk@webdna.us> Bug Reporting: support@webdna.us>---------------------------------------------------------This message is sent to you because you are subscribed tothe mailing list .To unsubscribe, E-mail to: archives: http://mail.webdna.us/list/talk@webdna.usBug Reporting: support@webdna.us--001a11c3c6d2451a77052ecf47d1Content-Type: text/html; charset=UTF-8Content-Transfer-Encoding: quoted-printable

Hi all,

Thought I would add my approach= to 'pretty' urls using mod_rewrite rather than routing through an =error document. =C2=A0

Basically everything except= images and folders/files that I specify are routed to 'parser.tmpl'=;.=C2=A0 That template then parses the URL and you can the search databases=, include files etc.

Here's a sample htaccess =file with all the mod_rewrite stuff and some other things that people might= find useful. =C2=A0

- Tom

PS. This a great resource on what can be do=ne using the htaccess file

https://github.com/h5bp/html5-bo=ilerplate/blob/master/dist/.htaccess

<=/div>

# Better website experience for IE

Header set X-UA-Compatible "IE=3Dedge"<=/div>

<FilesMatch "\.(appca=che|crx|css|eot|gif|htc|ico|jpe?g|js|m4a|m4v|manifest|mp4|oex|oga|ogg|ogv|o=tf|pdf|png|safariextz|svgz?|ttf|vcf|webapp|webm|webp|woff|xml|xpi)$"&g=t;

Header unset X-UA-Compatible

</FilesMatch>

DirectoryIndex index.html index.tmpl

<=font face=3D"monospace, monospace">

# Proper MIME types for all files

AddType application/javascript =C2=A0 =C2=A0 ==C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0js

AddType application/json =C2=A0 =C2=A0 =C2=A0 ==C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0json

AddType video/mp4 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2==A0 mp4 m4v f4v f4p

Ad=dType video/x-flv =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 flv

AddType application/font-woff =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0 =C2=A0 =C2=A0 =C2=A0 woff

AddType application/vnd.ms-fontobject =C2=A0 =C2=A0 =C2=A0 =C2==A0 =C2=A0 eot

AddType= application/x-font-ttf =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0 =C2=A0ttc ttf

A=ddType font/opentype =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2==A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 otf

AddType image/svg+xml =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2==A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 svg svgz=

AddEncoding gzip =C2=A0 =C2==A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0 =C2=A0 =C2=A0 =C2=A0svgz

AddType= application/x-shockwave-flash =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 swf

AddType application/xml =C2==A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0 atom rdf rss xml

AddType image/x-icon =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2==A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ico

AddType text/vtt =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2==A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0vtt

AddType text=/x-component =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0htc

AddType text/x-vcard =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ==C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vcf

AddType text/csv =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2==A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0csv

# UTF-8 encoding

AddDefaultCharset utf-8

AddCharset utf-8 .atom .css .js .json .rss .vtt= .webapp .xml

# Security - Block access= to directories without a default document

Options -Indexes

#= Block access to backup and source files

Order =allow,deny

Deny from all

=Satisfy All

&l=t;/FilesMatch>

# Rewrite engine

RewriteEngine On

# Redirect to Main 'www' Domain

RewriteCond %{HTTP_HOST} ^your=domain\.com [NC]

Rewri=teRule ^(.*)$ http://www.yourdomai=n.com/$1 [R=3D301,NC,L]

# Exclude t=hese directories and files from rewrite

RewriteRule ^(admin|otherdirectories|parser\.tmpl|robots\=..txt)($|/) - [L]

<=/font>

# Exclude images from =rewrite

RewriteCond %{=REQUEST_URI} !\.(gif|jp?g|png|css|ico) [NC]

# Route everything else through parser.tmpl

RewriteRule . /parser.tmpl?requestedurl=3D%{REQU=EST_URI}&query=3D%{QUERY_STRING}&serverport=3D%{SERVER_PORT} [L]

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D
Digital Revolutionaries
1st Floor, Castleriver Hou=se
14-15 Parliament Street
Temple Bar,Dublin 2
Ireland
--------=--------------------------------------
[t]: + 353 1 4403907
[e]: <=mailto:tom@revo=lutionaries.ie>
[w]: <http://www.revolutionaries.ie/>
=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

On 24 March 2016 at 17:50, <=christophe.billiottet@webdna.us> wrote:

What about using [referrer] to allow your customers navigate =your website but disallow bookmarking and outside links? you could also use= [session] to limit the navigation to X minutes or Y pages, even for bots, =then "kick" the visitor out.

- chris

> On Mar 24, 2016, at 20:30, Brian Burton <brian@burtons.com> wrote:
>
> Backstory: the site is question is a replacements part business and ha=s hundreds of thousands of pages of cross reference material, all stored in= databases and generated as needed. Competitors and dealers that carry comp=etitors brand parts seem to think that copying our cross reference is easie=r then creating their own (it would be) so code was written to block this.<=br>>
> YES, I KNOW that if they are determined, they will find a way around m=y blockades (I=E2=80=99ve seen quite a few variations on this: tor, AWS, ot=her VPNs=E2=80=A6)
>
> Solution: looking at the stats for the average use of the website, we =found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6> I have a visitors.db. The system logs all page requests tracked by IP =address, and after a set amount (more then 14 pages, but still a pretty low= number) starts showing visitors a nice Page Limit Exceeded page instead of= what they were crawling thru. After an unreasonable number of pages I just= 404 them out to save server time and bandwidth. The count resets at midnig=ht, because I=E2=80=99m far to lazy to track 24 hours since the first or la=st page request (per IP.) In some cases, when I=E2=80=99m feeling particula=rly mischievous, once a bot is detected i start feeding them fake info :D>
> Here=E2=80=99s the Visitors.db header:=C2=A0 (not sure if it will help=, but it is what it is)
> VID=C2=A0 =C2=A0IPadd=C2=A0 =C2=A0ipperm=C2=A0 ipname=C2=A0 visitdate==C2=A0 =C2=A0 =C2=A0 =C2=A0pagecount=C2=A0 =C2=A0 =C2=A0 =C2=A0starttime=C2==A0 =C2=A0 =C2=A0 =C2=A0endtime domain=C2=A0 firstpage=C2=A0 =C2=A0 =C2=A0 ==C2=A0lastpage=C2=A0 =C2=A0 =C2=A0 =C2=A0 browtype=C2=A0 =C2=A0 =C2=A0 =C2==A0 lastsku partner linkin=C2=A0 page9=C2=A0 =C2=A0page8=C2=A0 =C2=A0page7==C2=A0 =C2=A0page6=C2=A0 =C2=A0page5=C2=A0 =C2=A0page4=C2=A0 =C2=A0page3=C2==A0 =C2=A0page2=C2=A0 =C2=A0page1
>
>
> All the code that does the tracking and counting and map/reduction to =store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what (if =anything) I can share a bit later, and try to write it up as a blog post or= something.
>
> -Brian B. Burton
>
>> On Mar 24, 2016, at 11:41 AM, Jym Duane <jym@purposemedia.com> wrote:
>>
>> curious how to determine...non google/bing/yahoo bots and other at=tempting to crawl/copy the entire site?
>>
>>
>>
>> On 3/24/2016 9:28 AM, Brian Burton wrote:
>>> Noah,
>>>
>>> Similar to you, and wanting to use pretty URLs I built somethi=ng similar, but did it a different way.
>>> _All_ page requests are caught by a url-rewrite rule and get s=ent to dispatch.tpl
>>> Dispatch.tpl has hundreds of rules that decide what page to sh=ow, and uses includes to do it.
>>> (this keeps everything in-house to webdna so i don=E2=80=99t h=ave to go mucking about in webdna here, and apache there, and linux somewhe=re else, and etc=E2=80=A6)
>>>
>>> Three special circumstances came up that needed special code t=o send out proper HTTP status codes:
>>>
>>> <!=E2=80=94 for page URLS that have permanently moved (webd=na sends out a 302 temporarily moved code on a redirect) =E2=80=94>
>>> [function name=3D301public]
>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]
>>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: http://ww=w.example.com[link][eol][eol][/returnraw]
>>> [/function]
>>>
>>> <!=E2=80=94 I send this to non google/bing/yahoo bots and o=ther attempting to crawl/copy the entire site=E2=80=94>
>>> [function name=3D404hard]
>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]
>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not Found[eo=l]Content-type: text/html[eol][eol]<html>[eol]<body>[eol]<h1=>404 Not Found</h1>[eol]The page that you have requested ([thisurl=]) could not be found.[eol]</body>[eol]</html>[/returnraw]
>>> [/function]
>>>
>>> <!=E2=80=94 and finally a pretty 404 page for humans =E2=80==94>
>>> [function name=3D404soft]
>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]
>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not Found[eo=l]Content-type: text/html[eol][eol][include file=3D/404pretty.tpl][/returnr=aw]
>>> [/function]
>>>
>>> Hope this helps
>>> -Brian B. Burton
>

----------------------------------------------=-----------
This message is sent to you because you are subscribed to
the mailing list <talk@webdna.u=s>.
To unsubscribe, E-mail to: <talk=-leave@webdna.us>

archives: http://ma=il.webdna.us/list/talk@webdna.us
Bug Reporting: support@webdna.us

--001a11c3c6d2451a77052ecf47d1--. Associated Messages, from the most recent to the oldest:

260--001a11c3c6d2451a77052ecf47d1Content-Type: text/plain; charset=UTF-8Content-Transfer-Encoding: quoted-printableHi all,Thought I would add my approach to 'pretty' urls using mod_rewrite ratherthan routing through an error document.Basically everything except images and folders/files that I specify arerouted to 'parser.tmpl'. That template then parses the URL and you can thesearch databases, include files etc.Here's a sample htaccess file with all the mod_rewrite stuff and some otherthings that people might find useful.- TomPS. This a great resource on what can be done using the htaccess filehttps://github.com/h5bp/html5-boilerplate/blob/master/dist/.htaccess# Better website experience for IEHeader set X-UA-Compatible "IE=3Dedge"Header unset X-UA-CompatibleDirectoryIndex index.html index.tmpl# Proper MIME types for all filesAddType application/javascript jsAddType application/json jsonAddType video/mp4 mp4 m4v f4v f4pAddType video/x-flv flvAddType application/font-woff woffAddType application/vnd.ms-fontobject eotAddType application/x-font-ttf ttc ttfAddType font/opentype otfAddType image/svg+xml svg svgzAddEncoding gzip svgzAddType application/x-shockwave-flash swfAddType application/xml atom rdf rss xmlAddType image/x-icon icoAddType text/vtt vttAddType text/x-component htcAddType text/x-vcard vcfAddType text/csv csv# UTF-8 encodingAddDefaultCharset utf-8AddCharset utf-8 .atom .css .js .json .rss .vtt .webapp .xml# Security - Block access to directories without a default documentOptions -Indexes# Block access to backup and source filesOrder allow,denyDeny from allSatisfy All# Rewrite engineRewriteEngine On# Redirect to Main 'www' DomainRewriteCond %{HTTP_HOST} ^yourdomain\.com [NC]RewriteRule ^(.*)$ http://www.yourdomain.com/$1 [R=3D301,NC,L]# Exclude these directories and files from rewriteRewriteRule ^(admin|otherdirectories|parser\.tmpl|robots\.txt)($|/) - [L]# Exclude images from rewriteRewriteCond %{REQUEST_URI} !\.(gif|jp?g|png|css|ico) [NC]# Route everything else through parser.tmplRewriteRule ./parser.tmpl?requestedurl=3D%{REQUEST_URI}&query=3D%{QUERY_STRING}&serverpo=rt=3D%{SERVER_PORT}[L]=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DDigital Revolutionaries1st Floor, Castleriver House14-15 Parliament StreetTemple Bar,Dublin 2Ireland----------------------------------------------[t]: + 353 1 4403907[e]: [w]: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D==3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DOn 24 March 2016 at 17:50, wrote:> What about using [referrer] to allow your customers navigate your website> but disallow bookmarking and outside links? you could also use [session] =to> limit the navigation to X minutes or Y pages, even for bots, then "kick"> the visitor out.>>> - chris>>>>> > On Mar 24, 2016, at 20:30, Brian Burton wrote:> >> > Backstory: the site is question is a replacements part business and has> hundreds of thousands of pages of cross reference material, all stored in> databases and generated as needed. Competitors and dealers that carry> competitors brand parts seem to think that copying our cross reference is> easier then creating their own (it would be) so code was written to block> this.> >> > YES, I KNOW that if they are determined, they will find a way around my> blockades (I=E2=80=99ve seen quite a few variations on this: tor, AWS, ot=her VPNs=E2=80=A6)> >> > Solution: looking at the stats for the average use of the website, we> found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6> > I have a visitors.db. The system logs all page requests tracked by IP> address, and after a set amount (more then 14 pages, but still a pretty l=ow> number) starts showing visitors a nice Page Limit Exceeded page instead o=f> what they were crawling thru. After an unreasonable number of pages I jus=t> 404 them out to save server time and bandwidth. The count resets at> midnight, because I=E2=80=99m far to lazy to track 24 hours since the fir=st or last> page request (per IP.) In some cases, when I=E2=80=99m feeling particular=ly> mischievous, once a bot is detected i start feeding them fake info :D> >> > Here=E2=80=99s the Visitors.db header: (not sure if it will help, but =it is> what it is)> > VID IPadd ipperm ipname visitdate pagecount starttime> endtime domain firstpage lastpage browtype> lastsku partner linkin page9 page8 page7 page6 page5 page4> page3 page2 page1> >> >> > All the code that does the tracking and counting and map/reduction to> store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what (i=f> anything) I can share a bit later, and try to write it up as a blog post =or> something.> >> > -Brian B. Burton> >> >> On Mar 24, 2016, at 11:41 AM, Jym Duane wrote:> >>> >> curious how to determine...non google/bing/yahoo bots and other> attempting to crawl/copy the entire site?> >>> >>> >>> >> On 3/24/2016 9:28 AM, Brian Burton wrote:> >>> Noah,> >>>> >>> Similar to you, and wanting to use pretty URLs I built something> similar, but did it a different way.> >>> _All_ page requests are caught by a url-rewrite rule and get sent to> dispatch.tpl> >>> Dispatch.tpl has hundreds of rules that decide what page to show, and> uses includes to do it.> >>> (this keeps everything in-house to webdna so i don=E2=80=99t have to =go> mucking about in webdna here, and apache there, and linux somewhere else,> and etc=E2=80=A6)> >>>> >>> Three special circumstances came up that needed special code to send> out proper HTTP status codes:> >>>> >>> temporarily moved code on a redirect) =E2=80=94>> >>> [function name=3D301public]> >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]> >>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location:> http://www.example.com[link][eol][eol][/returnraw]> >>> [/function]> >>>> >>> crawl/copy the entire site=E2=80=94>> >>> [function name=3D404hard]> >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]> >>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not> Found[eol]Content-type: text/html[eol][eol][eol][eol]