深入理解TCP协议及其源代码
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
深⼊理解TCP协议及其源代码
TCP是⼀种⾯向连接、可靠、基于字节流的传输协议,位于TCP/IP模型的传输层。
⾯向连接:不同于UDP,TCP协议需要通信双⽅确定彼此已经建⽴连接后才可以进⾏数据传输;
可靠:连接建⽴的双⽅在进⾏通信时,TCP保证了不会存在数据丢失,或是数据丢失后存在拯救丢失的措施;
字节流:实际传输中,不论是何种数据,TCP都按照字节的⽅式传输,⽽⾮以数据包为单位。
1.TCP建⽴连接的三次握⼿
TCP是⼀个⾯向连接的协议,⽆论哪⼀⽅向另⼀⽅发送数据之前,都必须先在双⽅之间建⽴⼀条连接,所谓三次握⼿(Three-Way Handshake)即建⽴TCP连接,就是指建⽴⼀个TCP连接时,需要客户端和服务端总共发送3个包以确认连接的建⽴。
在socket编程中,这⼀过程由客户端执⾏connect来触发,整个流程如下图所⽰:
(1)第⼀次握⼿:Client将标志位SYN置为1,随机产⽣⼀个值seq=J,并将该数据包发送给Server,Client进⼊SYN_SENT状态,等待Server确认。
(2)第⼆次握⼿:Server收到数据包后由标志位SYN=1知道Client请求建⽴连接,Server将标志位SYN和ACK都置为1,ack=J+1,随机产⽣⼀个值seq=K,并将该数据包发送给Client以确认连接请求,Server进⼊SYN_RCVD状态。
(3)第三次握⼿:Client收到确认后,检查ack是否为J+1,ACK是否为1,如果正确则将标志位ACK置为1,ack=K+1,并将该数据包发送给Server,Server检查ack是否为K+1,ACK是否为1,如果正确则连接建⽴成功,Client和Server进⼊ESTABLISHED状态,完成三次握⼿,随后Client与Server之间可以开始传输数据了。
简单来说,就是:
1、建⽴连接时,客户端发送SYN包(SYN=i)到服务器,并进⼊到SYN-SEND状态,等待服务器确认;
2、服务器收到SYN包,必须确认客户的SYN(ack=i+1),同时⾃⼰也发送⼀个SYN包(SYN=k),即SYN+ACK包,此时服务器进⼊SYN-RECV状态;
3、客户端收到服务器的SYN+ACK包,向服务器发送确认报ACK(ack=k+1),此包发送完毕,客户端和服务器进⼊ESTABLISHED状态,完成三次握⼿,客户端与服务器开始传送数据。
TCP状态转换图:
CLOSED:起始点,在超时或者连接关闭时候进⼊此状态,这并不是⼀个真正的状态,⽽是这个状态图的假想起点和终点。
LISTEN:服务器端等待连接的状态。
服务器经过socket,bind,listen函数之后进⼊此状态,开始监听客户端发过来的连接请求。
此称为应⽤程序被动打开(等到客户端连接请求)。
SYN_SENT:第⼀次握⼿发⽣阶段,客户端发起连接。
客户端调⽤connect,发送SYN给服务器端,然后进⼊SYN_SENT状态,等待服务器端确认(三次握⼿中的第⼆个报⽂)。
如果服务器端不能连接,则直接进⼊CLOSED状态。
SYN_RCVD:第⼆次握⼿发⽣阶段,跟3对应,这⾥是服务器端接收到了客户端的SYN,此时服务器由LISTEN进⼊SYN_RCVD状态,同时服务器端回应⼀个ACK,然后再发送⼀个SYN即SYN+ACK给客户端。
状态图中还描绘了这样⼀种情况,当客户端在发送SYN的同时也收到服务器端的SYN请求,即两个同时发起连接请求,那么客户端就会从SYN_SENT转换到SYN_REVD状态。
ESTABLISHED:第三次握⼿发⽣阶段,客户端接收到服务器端的ACK包(ACK,SYN)之后,也会发送⼀个ACK确认包,客户端进⼊ESTABLISHED状态,表明客户端这边已经准备好,但TCP需要两端都准备好才可以进⾏数据传输。
服务器端收到客户端的ACK之后会从SYN_RCVD状态转移到ESTABLISHED状态,表明服务器端也准备好进⾏数据传输了。
这样客户端和服务器端都是ESTABLISHED状态,就可以进⾏后⾯的数据传输了。
所以ESTABLISHED也可以说是⼀个数据传送状态。
2.内核源码分析
客户端调⽤connect主动发起连接:
/*
* Attempt to connect to a socket with the server address. The address
* is in user space so we verify it is OK and move it to kernel space.
*
* For 1003.1g we need to add clean support for a bind to AF_UNSPEC to
* break bindings
*
* NOTE: 1003.1g draft 6.3 is broken with respect to AX.25/NetROM and
* other SEQPACKET protocols that take time to connect() as it doesn't
* include the -EINPROGRESS status for such sockets.
*/
int __sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen)
{
struct socket *sock;
struct sockaddr_storage address;
int err, fput_needed;
sock = sockfd_lookup_light(fd, &err, &fput_needed);
if (!sock)
goto out;
//将地址对象从⽤户空间拷贝到内核空间
err = move_addr_to_kernel(uservaddr, addrlen, &address);
if (err < 0)
goto out_put;
err =
security_socket_connect(sock, (struct sockaddr *)&address, addrlen);
if (err)
goto out_put;
err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen,
sock->file->f_flags);
out_put:
fput_light(sock->file, fput_needed);
out:
return err;
}
SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr,
int, addrlen)
{
return __sys_connect(fd, uservaddr, addrlen);
}
该函数根据⽂件描述符找到指定的socket对象,将地址信息从⽤户空间拷贝到内核空间,调⽤指定类型套接字的connect函数。
对应流式套接字的connect函数是inet_stream_connect:
int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
int addr_len, int flags)
{
int err;
lock_sock(sock->sk);
err = __inet_stream_connect(sock, uaddr, addr_len, flags);
release_sock(sock->sk);
return err;
}
/*
* Connect to a remote host. There is regrettably still a little
* TCP 'magic' in here.
*/
//1. 检查socket地址长度和使⽤的协议族。
//2. 检查socket的状态,必须是SS_UNCONNECTED或SS_CONNECTING。
//3. 调⽤tcp_v4_connect()来发送SYN包。
//4. 等待后续握⼿的完成:
int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
int addr_len, int flags)
{
struct sock *sk = sock->sk;
long timeo;
if (addr_len < sizeof(uaddr->sa_family))
return -EINVAL;
if (uaddr->sa_family == AF_UNSPEC) {
err = sk->sk_prot->disconnect(sk, flags);
sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED;
goto out;
}
switch (sock->state) {
default:
err = -EINVAL;
goto out;
case SS_CONNECTED:
//已经是连接状态
err = -EISCONN;
goto out;
case SS_CONNECTING:
//正在连接
err = -EALREADY;
/* Fall out of switch with err, set for this state */
break;
case SS_UNCONNECTED:
err = -EISCONN;
if (sk->sk_state != TCP_CLOSE)
goto out;
//对于流式套接字,sock->ops为 inet_stream_ops --> inet_stream_connect --> tcp_prot --> tcp_v4_connect
//对于数据报套接字,sock->ops为 inet_dgram_ops --> inet_dgram_connect --> udp_prot --> ip4_datagram_connect
err = sk->sk_prot->connect(sk, uaddr, addr_len);
if (err < 0)
goto out;
sock->state = SS_CONNECTING;
/* Just entered SS_CONNECTING state; the only
* difference is that return value in non-blocking
* case is EINPROGRESS, rather than EALREADY.
*/
err = -EINPROGRESS;
break;
}
//获取阻塞时间timeo。
如果socket是⾮阻塞的,则timeo是0
//connect()的超时时间为sk->sk_sndtimeo,在sock_init_data()中初始化为MAX_SCHEDULE_TIMEOUT,表⽰⽆限等待,可以通过SO_SNDTIMEO选项来修改 timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);
if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
int writebias = (sk->sk_protocol == IPPROTO_TCP) &&
tcp_sk(sk)->fastopen_req &&
tcp_sk(sk)->fastopen_req->data ? 1 : 0;
/* Error code is set above */
if (!timeo || !inet_wait_for_connect(sk, timeo, writebias))
goto out;
err = sock_intr_errno(timeo);
//进程收到信号,如果err为-ERESTARTSYS,接下来库函数会重新调⽤connect()
if (signal_pending(current))
goto out;
}
/* Connection was closed by RST, timeout, ICMP error
* or another process disconnected us.
*/
if (sk->sk_state == TCP_CLOSE)
goto sock_error;
/* sk->sk_err may be not zero now, if RECVERR was ordered by user
* and error was received after socket entered established state.
* Hence, it is handled normally after connect() return successfully.
*/
//更新socket状态为连接已建⽴
sock->state = SS_CONNECTED;
//清除错误码
err = 0;
out:
sock_error:
err = sock_error(sk) ? : -ECONNABORTED;
sock->state = SS_UNCONNECTED;
//如果使⽤的是TCP,则sk_prot为tcp_prot,disconnect为tcp_disconnect()
if (sk->sk_prot->disconnect(sk, flags))
//如果失败
sock->state = SS_DISCONNECTING;
goto out;
}
该函数⾸先检查socket地址长度和使⽤的协议族,检查socket的状态,必须是SS_UNCONNECTED或SS_CONNECTING,调⽤实现协议的connect函数,对于流式套接字,实现协议是tcp,调⽤的是tcp_v4_connect(),对于阻塞调⽤,等待后续握⼿的完成;对于⾮阻塞调⽤,则直接返回 -EINPROGRESS。
tcp_v4_connect的源代码:
/* This will initiate an outgoing connection. */
int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
struct inet_sock *inet = inet_sk(sk);
struct tcp_sock *tp = tcp_sk(sk);
__be16 orig_sport, orig_dport;
__be32 daddr, nexthop;
struct flowi4 *fl4;
struct rtable *rt;
int err;
struct ip_options_rcu *inet_opt;
if (addr_len < sizeof(struct sockaddr_in))
return -EINVAL;
if (usin->sin_family != AF_INET)
return -EAFNOSUPPORT;
nexthop = daddr = usin->sin_addr.s_addr;
inet_opt = rcu_dereference_protected(inet->inet_opt,
lockdep_sock_is_held(sk));
//将下⼀跳地址和⽬的地址的临时变量都暂时设为⽤户提交的地址。
if (inet_opt && inet_opt->opt.srr) {
if (!daddr)
return -EINVAL;
nexthop = inet_opt->opt.faddr;
}
//源端⼝
orig_sport = inet->inet_sport;
//⽬的端⼝
orig_dport = usin->sin_port;
fl4 = &inet->cork.fl.u.ip4;
//如果使⽤了来源地址路由,选择⼀个合适的下⼀跳地址。
rt = ip_route_connect(fl4, nexthop, inet->inet_saddr,
RT_CONN_FLAGS(sk), sk->sk_bound_dev_if,
IPPROTO_TCP,
orig_sport, orig_dport, sk);
if (IS_ERR(rt)) {
err = PTR_ERR(rt);
if (err == -ENETUNREACH)
IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
return err;
}
if (rt->rt_flags & (RTCF_MULTICAST | RTCF_BROADCAST)) {
ip_rt_put(rt);
return -ENETUNREACH;
}
//进⾏路由查找,并校验返回的路由的类型,TCP是不被允许使⽤多播和⼴播的
if (!inet_opt || !inet_opt->opt.srr)
daddr = fl4->daddr;
//更新⽬的地址临时变量——使⽤路由查找后返回的值
if (!inet->inet_saddr)
inet->inet_saddr = fl4->saddr;
sk_rcv_saddr_set(sk, inet->inet_saddr);
//如果还没有设置源地址,和本地发送地址,则使⽤路由中返回的值
if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
/* Reset inherited state */
tp->rx_opt.ts_recent = 0;
tp->rx_opt.ts_recent_stamp = 0;
if (likely(!tp->repair))
tp->write_seq = 0;
}
if (tcp_death_row.sysctl_tw_recycle &&
!tp->rx_opt.ts_recent_stamp && fl4->daddr == daddr)
tcp_fetch_timewait_stamp(sk, &rt->dst);
//保存⽬的地址及端⼝
inet->inet_dport = usin->sin_port;
sk_daddr_set(sk, daddr);
inet_csk(sk)->icsk_ext_hdr_len = 0;
if (inet_opt)
inet_csk(sk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
//设置最⼩允许的mss值 536
tp->rx_opt.mss_clamp = TCP_MSS_DEFAULT;
/* Socket identity is still unknown (sport may be zero).
* However we set state to SYN-SENT and not releasing socket
* lock select source port, enter ourselves into the hash tables and
* complete initialization after this.
*/
//套接字状态被置为 TCP_SYN_SENT,
tcp_set_state(sk, TCP_SYN_SENT);
err = inet_hash_connect(&tcp_death_row, sk);
if (err)
goto failure;
sk_set_txhash(sk);
//动态选择⼀个本地端⼝,并加⼊ hash 表,与bind(2)选择端⼝类似
rt = ip_route_newports(fl4, rt, orig_sport, orig_dport,
inet->inet_sport, inet->inet_dport, sk);
if (IS_ERR(rt)) {
err = PTR_ERR(rt);
rt = NULL;
goto failure;
}
/* OK, now commit destination to socket. */
sk->sk_gso_type = SKB_GSO_TCPV4;
sk_setup_caps(sk, &rt->dst);
if (!tp->write_seq && likely(!tp->repair))
tp->write_seq = secure_tcp_sequence_number(inet->inet_saddr,
inet->inet_daddr,
inet->inet_sport,
usin->sin_port);
inet->inet_id = tp->write_seq ^ jiffies;
//函数⽤来根据 sk 中的信息,构建⼀个完成的 syn 报⽂,并将它发送出去。
err = tcp_connect(sk);
rt = NULL;
if (err)
goto failure;
return0;
failure:
/*
* This unhashes the socket and releases the local port,
* if necessary.
*/
tcp_set_state(sk, TCP_CLOSE);
ip_rt_put(rt);
sk->sk_route_caps = 0;
inet->inet_dport = 0;
return err;
}
该函数完成了路由查找,得到下⼀跳地址,并更新socket对象的下⼀跳地址,将socket对象的状态设置为TCP_SYN_SENT,如果没设置序号初值,则选定⼀个随机初值,调⽤函数tcp_connect完成报⽂构建和发送。
tcp_connect的源码:
/* Build a SYN and send it off. */
int tcp_connect(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *buff;
int err;
//初始化传输控制块中与连接相关的成员
tcp_connect_init(sk);
if (unlikely(tp->repair)) {
tcp_finish_connect(sk, NULL);
return0;
}
//分配skbuff --> 为SYN段分配报⽂并进⾏初始化
buff = sk_stream_alloc_skb(sk, 0, sk->sk_allocation, true);
if (unlikely(!buff))
return -ENOBUFS;
//构建syn报⽂
//在函数tcp_v4_connect中write_seq已经被初始化随机值
tcp_init_nondata_skb(buff, tp->write_seq++, TCPHDR_SYN);
tp->retrans_stamp = tcp_time_stamp;
//将报⽂添加到发送队列上
tcp_connect_queue_skb(sk, buff);
//显式拥塞通告 --->
//路由器在出现拥塞时通知TCP。
当TCP段传递时,路由器使⽤IP⾸部中的2位来记录拥塞,当TCP段到达后,
//接收⽅知道报⽂段是否在某个位置经历过拥塞。
然⽽,需要了解拥塞发⽣情况的是发送⽅,⽽⾮接收⽅。
因
//此,接收⽅使⽤下⼀个ACK通知发送⽅有拥塞发⽣,然后,发送⽅做出响应,缩⼩⾃⼰的拥塞窗⼝。
tcp_ecn_send_syn(sk, buff);
/* Send off SYN; include data in Fast Open. */
err = tp->fastopen_req ? tcp_send_syn_data(sk, buff) :
//构造tcp头和ip头并发送
tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
if (err == -ECONNREFUSED)
return err;
/* We change tp->snd_nxt after the tcp_transmit_skb() call
* in order to make this packet get counted in tcpOutSegs.
*/
tp->snd_nxt = tp->write_seq;
tp->pushed_seq = tp->write_seq;
TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS);
/* Timer for repeating the SYN until an answer. */
//启动重传定时器
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
return0;
}
该函数完成:
1 初始化套接字跟连接相关的字段
2 申请sk_buff空间
3 将sk_buff初始化为syn报⽂,实质是操作tcp_skb_cb,在初始化TCP头的时候会⽤到
4 调⽤tcp_connect_queue_skb()函数将报⽂sk_buff添加到发送队列sk->sk_write_queue
5 调⽤tcp_transmit_skb()函数构造tcp头,然后交给⽹络层。
6 初始化重传定时器
tcp_connect_queue_skb()函数的原理主要是移动sk_buff的data指针,然后填充TCP头。
再然后将报⽂交给⽹络层,将报⽂发出。
这样,三次握⼿中的第⼀次握⼿在客户端的层⾯完成,报⽂到达服务端,由服务端处理完毕后,第⼀次握⼿完成,客户端socket状态变为TCP_SYN_SENT。
tcp_v4_rcv()源码:
/*
* From tcp_input.c
*/
int tcp_v4_rcv(struct sk_buff *skb)
{
struct net *net = dev_net(skb->dev);
int sdif = inet_sdif(skb);
const struct iphdr *iph;
const struct tcphdr *th;
bool refcounted;
struct sock *sk;
int ret;
if (skb->pkt_type != PACKET_HOST)
goto discard_it;
/* Count it even if it's bad */
__TCP_INC_STATS(net, TCP_MIB_INSEGS);
if (!pskb_may_pull(skb, sizeof(struct tcphdr)))
goto discard_it;
th = (const struct tcphdr *)skb->data;
if (unlikely(th->doff < sizeof(struct tcphdr) / 4))
goto bad_packet;
if (!pskb_may_pull(skb, th->doff * 4))
goto discard_it;
/* An explanation is required here, I think.
* Packet length and doff are validated by header prediction,
* provided case of th->doff==0 is eliminated.
* So, we defer the checks. */
if (skb_checksum_init(skb, IPPROTO_TCP, inet_compute_pseudo))
goto csum_error;
th = (const struct tcphdr *)skb->data;
iph = ip_hdr(skb);
lookup:
sk = __inet_lookup_skb(&tcp_hashinfo, skb, __tcp_hdrlen(th), th->source, th->dest, sdif, &refcounted);
if (!sk)
goto no_tcp_socket;
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
if (sk->sk_state == TCP_NEW_SYN_RECV) {
struct request_sock *req = inet_reqsk(sk);
bool req_stolen = false;
struct sock *nsk;
sk = req->rsk_listener;
if (unlikely(tcp_v4_inbound_md5_hash(sk, skb))) {
sk_drops_add(sk, skb);
reqsk_put(req);
goto discard_it;
}
if (tcp_checksum_complete(skb)) {
reqsk_put(req);
goto csum_error;
}
if (unlikely(sk->sk_state != TCP_LISTEN)) {
inet_csk_reqsk_queue_drop_and_put(sk, req);
goto lookup;
}
/* We own a reference on the listener, increase it again
* as we might lose it too soon.
*/
sock_hold(sk);
refcounted = true;
nsk = NULL;
if (!tcp_filter(sk, skb)) {
th = (const struct tcphdr *)skb->data;
iph = ip_hdr(skb);
tcp_v4_fill_cb(skb, iph, th);
nsk = tcp_check_req(sk, skb, req, false, &req_stolen); }
if (!nsk) {
reqsk_put(req);
if (req_stolen) {
/* Another cpu got exclusive access to req
* and created a full blown socket.
* Try to feed this packet to this socket
* instead of discarding it.
*/
tcp_v4_restore_cb(skb);
sock_put(sk);
goto lookup;
}
goto discard_and_relse;
}
if (nsk == sk) {
reqsk_put(req);
tcp_v4_restore_cb(skb);
} else if (tcp_child_process(sk, nsk, skb)) {
tcp_v4_send_reset(nsk, skb);
goto discard_and_relse;
} else {
sock_put(sk);
return0;
}
}
if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
__NET_INC_STATS(net, LINUX_MIB_TCPMINTTLDROP);
goto discard_and_relse;
}
if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
goto discard_and_relse;
if (tcp_v4_inbound_md5_hash(sk, skb))
goto discard_and_relse;
nf_reset(skb);
if (tcp_filter(sk, skb))
goto discard_and_relse;
th = (const struct tcphdr *)skb->data;
iph = ip_hdr(skb);
tcp_v4_fill_cb(skb, iph, th);
skb->dev = NULL;
if (sk->sk_state == TCP_LISTEN) {
ret = tcp_v4_do_rcv(sk, skb);
goto put_and_return;
}
sk_incoming_cpu_update(sk);
bh_lock_sock_nested(sk);
tcp_segs_in(tcp_sk(sk), skb);
ret = 0;
if (!sock_owned_by_user(sk)) {
ret = tcp_v4_do_rcv(sk, skb);
} else if (tcp_add_backlog(sk, skb)) {
goto discard_and_relse;
}
bh_unlock_sock(sk);
put_and_return:
if (refcounted)
sock_put(sk);
return ret;
no_tcp_socket:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
goto discard_it;
tcp_v4_fill_cb(skb, iph, th);
if (tcp_checksum_complete(skb)) {
csum_error:
__TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
tcp_v4_send_reset(NULL, skb);
}
discard_it:
/* Discard frame. */
kfree_skb(skb);
return0;
discard_and_relse:
sk_drops_add(sk, skb);
if (refcounted)
sock_put(sk);
goto discard_it;
do_time_wait:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
inet_twsk_put(inet_twsk(sk));
goto discard_it;
}
tcp_v4_fill_cb(skb, iph, th);
if (tcp_checksum_complete(skb)) {
inet_twsk_put(inet_twsk(sk));
goto csum_error;
}
switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) {
case TCP_TW_SYN: {
struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev),
&tcp_hashinfo, skb,
__tcp_hdrlen(th),
iph->saddr, th->source,
iph->daddr, th->dest,
inet_iif(skb),
sdif);
if (sk2) {
inet_twsk_deschedule_put(inet_twsk(sk));
sk = sk2;
tcp_v4_restore_cb(skb);
refcounted = false;
goto process;
}
}
/* to ACK */
/* fall through */
case TCP_TW_ACK:
tcp_v4_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
tcp_v4_send_reset(sk, skb);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
case TCP_TW_SUCCESS:;
}
goto discard_it;
}
该函数主要⼯作就是根据tcp头部信息查到处理报⽂的socket对象,然后检查socket状态做不同处理,我们这⾥是监听状态TCP_LISTEN,直
接调⽤函数tcp_v4_do_rcv():
/* The socket must have it's spinlock held when we get
* here, unless it is a TCP_LISTEN socket.
*
* We have a potential double-lock case here, so even when
* doing backlog processing we use the BH locking scheme.
* This is because we cannot sleep with the original spinlock
* held.
*/
int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
struct sock *rsk;
if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
struct dst_entry *dst = sk->sk_rx_dst;
sock_rps_save_rxhash(sk, skb);
sk_mark_napi_id(sk, skb);
if (dst) {
if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif ||
!dst->ops->check(dst, 0)) {
dst_release(dst);
sk->sk_rx_dst = NULL;
}
}
tcp_rcv_established(sk, skb);
return0;
}
if (tcp_checksum_complete(skb))
goto csum_err;
if (sk->sk_state == TCP_LISTEN) {
struct sock *nsk = tcp_v4_cookie_check(sk, skb);
if (!nsk)
goto discard;
if (nsk != sk) {
if (tcp_child_process(sk, nsk, skb)) {
rsk = nsk;
goto reset;
}
return0;
}
} else
sock_rps_save_rxhash(sk, skb);
if (tcp_rcv_state_process(sk, skb)) {
rsk = sk;
goto reset;
}
return0;
reset:
tcp_v4_send_reset(rsk, skb);
discard:
kfree_skb(skb);
/* Be careful here. If this function gets more complicated and
* gcc suffers from register pressure on the x86, sk (in %ebx)
* might be destroyed here. This current version compiles correctly,
* but you have been warned.
*/
return0;
csum_err:
TCP_INC_STATS(sock_net(sk), TCP_MIB_CSUMERRORS);
TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
goto discard;
}
EXPORT_SYMBOL(tcp_v4_do_rcv);
⾸先函数检查当前是否处于半连接状态,并调⽤tcp_v4_hnd_req检查报⽂的状态字段,再针对报⽂类型调⽤不同函数进⾏处理,若是SYN 报⽂,则调⽤tcp_rcv_state_process函数,进⼊到下⼀阶段,客户端收到服务端的SYN+ACK,并发送ACK。
接着就是调⽤tcp_rcv_state_process():
/*
* This function implements the receiving procedure of RFC 793 for * all states except ESTABLISHED and TIME_WAIT.
* It's called from both tcp_v4_rcv and tcp_v6_rcv and should be
* address independent.
*/
int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
const struct tcphdr *th = tcp_hdr(skb);
struct request_sock *req;
int queued = 0;
bool acceptable;
switch (sk->sk_state) {
case TCP_CLOSE:
goto discard;
case TCP_LISTEN:
if (th->ack)
return1;
if (th->rst)
goto discard;
if (th->syn) {
if (th->fin)
goto discard;
/* It is possible that we process SYN packets from backlog, * so we need to make sure to disable BH and RCU right there. */
rcu_read_lock();
local_bh_disable();
acceptable = icsk->icsk_af_ops->conn_request(sk, skb) >= 0; local_bh_enable();
rcu_read_unlock();
if (!acceptable)
return1;
consume_skb(skb);
return0;
}
goto discard;
case TCP_SYN_SENT:
tp->rx_opt.saw_tstamp = 0;
tcp_mstamp_refresh(tp);
queued = tcp_rcv_synsent_state_process(sk, skb, th);
if (queued >= 0)
return queued;
/* Do step6 onward by hand. */
tcp_urg(sk, skb, th);
__kfree_skb(skb);
tcp_data_snd_check(sk);
return0;
}
tcp_mstamp_refresh(tp);
tp->rx_opt.saw_tstamp = 0;
req = tp->fastopen_rsk;
if (req) {
bool req_stolen;
WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
sk->sk_state != TCP_FIN_WAIT1);
if (!tcp_check_req(sk, skb, req, true, &req_stolen))
goto discard;
}
if (!th->ack && !th->rst && !th->syn)
goto discard;
if (!tcp_validate_incoming(sk, skb, th, 0))
return0;
/* step 5: check the ACK field */
acceptable = tcp_ack(sk, skb, FLAG_SLOWPATH |
FLAG_UPDATE_TS_RECENT |
FLAG_NO_CHALLENGE_ACK) > 0;
if (!acceptable) {
if (sk->sk_state == TCP_SYN_RECV)
return1; /* send one RST */
tcp_send_challenge_ack(sk, skb);
goto discard;
}
switch (sk->sk_state) {
case TCP_SYN_RECV:
tp->delivered++; /* SYN-ACK delivery isn't tracked in tcp_ack */ if (!tp->srtt_us)
tcp_synack_rtt_meas(sk, req);
/* Once we leave TCP_SYN_RECV, we no longer need req
* so release it.
*/
if (req) {
inet_csk(sk)->icsk_retransmits = 0;
reqsk_fastopen_remove(sk, req, false);
/* Re-arm the timer because data may have been sent out.
* This is similar to the regular data transmission case
* when new data has just been ack'ed.
*
* (TFO) - we could try to be more aggressive and
* retransmitting any data sooner based on when they
* are sent out.
*/
tcp_rearm_rto(sk);
} else {
tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); tp->copied_seq = tp->rcv_nxt;
}
smp_mb();
tcp_set_state(sk, TCP_ESTABLISHED);
sk->sk_state_change(sk);
/* Note, that this wakeup is only for marginal crossed SYN case.
* Passively open sockets are not waked up, because
* sk->sk_sleep == NULL and sk->sk_socket == NULL.
*/
if (sk->sk_socket)
sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
tp->snd_una = TCP_SKB_CB(skb)->ack_seq;
tp->snd_wnd = ntohs(th->window) << tp->rx_opt.snd_wscale;
tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
if (tp->rx_opt.tstamp_ok)
tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
if (!inet_csk(sk)->icsk_ca_ops->cong_control)
tcp_update_pacing_rate(sk);
/* Prevent spurious tcp_cwnd_restart() on first data packet */
tp->lsndtime = tcp_jiffies32;
tcp_initialize_rcv_mss(sk);
tcp_fast_path_on(tp);
break;
case TCP_FIN_WAIT1: {
int tmo;
/* If we enter the TCP_FIN_WAIT1 state and we are a
* Fast Open socket and this is the first acceptable
* ACK we have received, this would have acknowledged
* our SYNACK so stop the SYNACK timer.
*/
if (req) {
/* We no longer need the request sock. */
reqsk_fastopen_remove(sk, req, false);
tcp_rearm_rto(sk);
}
if (tp->snd_una != tp->write_seq)
break;
tcp_set_state(sk, TCP_FIN_WAIT2);
sk->sk_shutdown |= SEND_SHUTDOWN;
sk_dst_confirm(sk);
if (!sock_flag(sk, SOCK_DEAD)) {
/* Wake up lingering close() */
sk->sk_state_change(sk);
break;
}
if (tp->linger2 < 0) {
tcp_done(sk);
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
return1;
}
if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
/* Receive out of order FIN after close() */
if (tp->syn_fastopen && th->fin)
tcp_fastopen_active_disable(sk);
tcp_done(sk);
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
return1;
}
tmo = tcp_fin_time(sk);
if (tmo > TCP_TIMEWAIT_LEN) {
inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
} else if (th->fin || sock_owned_by_user(sk)) {
/* Bad case. We could lose such FIN otherwise.
* It is not a big problem, but it looks confusing
* and not so rare event. We still can lose it now,
* if it spins in bh_lock_sock(), but it is really
* marginal case.
*/
inet_csk_reset_keepalive_timer(sk, tmo);
} else {
tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
goto discard;
}
break;
}
case TCP_CLOSING:
if (tp->snd_una == tp->write_seq) {
tcp_time_wait(sk, TCP_TIME_WAIT, 0);
goto discard;
}
break;
case TCP_LAST_ACK:
if (tp->snd_una == tp->write_seq) {
tcp_update_metrics(sk);
tcp_done(sk);
goto discard;
}
break;
}
/* step 6: check the URG bit */
tcp_urg(sk, skb, th);
/* step 7: process the segment text */
switch (sk->sk_state) {
case TCP_CLOSE_WAIT:
case TCP_CLOSING:
case TCP_LAST_ACK:
if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt))
break;
/* fall through */
case TCP_FIN_WAIT1:
case TCP_FIN_WAIT2:
/* RFC 793 says to queue data in these states,
* RFC 1122 says we MUST send a reset.
* BSD 4.4 also does reset.
*/
if (sk->sk_shutdown & RCV_SHUTDOWN) {
if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA); tcp_reset(sk);
return1;
}
}
/* Fall through */
case TCP_ESTABLISHED:
tcp_data_queue(sk, skb);
queued = 1;
break;
}
/* tcp_data could move socket to TIME-WAIT */
if (sk->sk_state != TCP_CLOSE) {
tcp_data_snd_check(sk);
tcp_ack_snd_check(sk);
}
if (!queued) {
discard:
tcp_drop(sk, skb);
}
return0;
}
EXPORT_SYMBOL(tcp_rcv_state_process);。