3.1 Data source and key variables
This study uses a unique survey on Rural–urban Migration in China (RUMiC). The RUMiC database is constructed by a team of researchers from Australia and China, with the goal of studying issues such as the effect of rural–urban migration on income mobility and poverty alleviation, the state of education and health of children in migrant families, and the assimilation of migrant workers into the city (Akgüç et al. 2014).
The first wave of the survey was conducted in 2008, and the data became available in 2009. Three representative samples of households were surveyed, including a sample of 8,000 rural households, a sample of 5,000 rural–urban migrant households, and a sample of 5,000 urban households. In this paper, our empirical analyses use information mainly from the migrant sample. Since the migrants all came from rural areas, 99.4 percent of them have a rural hukou, although they currently live in cities.
The migrants surveyed were randomly chosen from the fifteen cities that are the top rural–urban migration destinations in China (see Figure 1). Eight of these cities are in coastal regions (Shanghai, Nanjing, Wuxi, Hangzhou, Ningbo, Guangzhou, Shenzhen, and Dongguan); five of them are in central inland regions (Zhengzhou, Luoyang, Hefei, Bengbu, and Wuhan); and two of them are in the west (Chengdu and Chongqing). A sampling procedure was very carefully designed to ensure that migrants in the database constituted a representative sample of all the migrants in the fifteen cities.6
The migrant survey was designed to collect information about every household member. It asked detailed questions that generate more than 700 variables. In terms of basic information of a household member, we know the person’s age, gender, education level, current address, home address before migration, etc. For information regarding employment experience, we know whether the person is self-employed or a wage worker, occupation, monthly income, how he/she found the current job, what was his/her first job, how he/she found the first job, etc. For the self-employed, we know why they chose self-employment, the amount and sources of money they borrowed for initial investment, the number of workers they currently hire, etc. Particularly useful for our study, the survey also asked about the migrant’s social and family network. We know who the migrant’s most important social contacts are and whether they live in the same city, whether the migrant’s parents and siblings also live in the same city, how many people the migrant greeted during the past Spring Festival, etc.
In our regression analysis, the dependent variable is whether an individual is currently self-employed or not. Among all of the migrant household heads in the database, 19.6 percent are self-employed.7 These individuals can be restaurant owners, convenient store owners, scrap metal collectors, street vendors, etc. or provide services such as shining shoes and repairing bicycles or electronics. A large proportion of these self-employed migrants simply work alone; only a quarter of them (25.4 percent) also hire other people. Among those who hire other people, the average number of employees is 3.5.
The type of work self-employed migrants do seems to be rather informal. It makes one wonder whether only the truly unemployable people fall into this status. It turns out this is not the case. In response to a survey question asking about the migrant’s reason to choose self-employment, the top three answers are: (1) it brings a higher income (answered by 38 percent of the self-employed migrants); (2) it gives more flexibility and freedom (29 percent); and (3) it allows one to be one’s own boss (19 percent). Only a small fraction (12 percent) report being self-employed because they cannot find wage work.
Consistent with their stated top reason, we indeed find that the self-employed migrants earn more income. The average monthly income is 1,447.7 Chinese yuan for wage-workers, 2,331.1 yuan for the self-employed who work alone, and 3,534.7 yuan for the self-employed who hire other people. We regress monthly income on employment status, controlling for gender, age, marital status, years of schooling, number of children, years since the person first migrated out of rural area, city fixed effects, and home province fixed effects. The results show that the self-employed with no employees earn 964.7 yuan more than wage workers, and those with employees earn an additional 973.5 yuan a month. Thus, for most migrants in our sample, self-employment status seems to be desirable.8
In our regression analysis of whether a migrant is self-employed, the key independent variable is the size of a person’s social-family network. To measure this size, we use the number of friends one greeted during the past Spring Festival, the number of relatives one greeted during the past Spring Festival, or the sum of these two numbers.
Spring Festival is the most important traditional holiday in China, which starts on the first day and ends on the fifteenth day of the first month of the Chinese lunar calendar. There are many traditional activities during the festival which vary widely across different regions in the country. But one tradition is followed throughout the country: during the festival, people greet family members, relatives, and friends, wishing them a happy, healthy, and wealthy new year. We therefore use the self-reported number of friends and relatives an individual greeted during the festival to measure the size of this migrant’s social-family network. Traditionally, greetings during the spring festival are mostly sent through personal visits. In recent years, greetings by phone, post, or even email are also becoming common, especially among the younger generations. Therefore, the persons greeted (i.e., the social-family network measured this way) are not necessarily local. Indeed, about half of the people greeted are currently not living in urban areas, most of whom are perhaps friends and relatives in their home villages.
This network size measure is behaviorally revealed and is more relevant for our purpose in this study. For example, a person may have a first cousin who is by definition one of his relatives. However, if they have a soured relationship and are not on speaking terms, or if they live far away from each other and have lost contact, then the cousin is in effect out of this person’s network. It is important to discount the cousin for our purpose because it is unlikely the cousin will provide any help when this person needs assistance during self-employment. Our measure will achieve this because if a relative was effectively outside a person’s network, this person would not have greeted him during the Spring Festival. Similarly, we believe that only a friend greeted is truly a friend, and our network size measure only includes such real friends.
3.2 Identification strategy and econometric specification
Despite the good features of this network size measure, it also has its drawbacks. For example, if a person has already chosen self-employment, he may have incentives to greet more friends and relatives simply because he has used or will likely seek their assistance during self-employment. For this reason, a simple correlation between self-employment status and network size cannot be interpreted as a causal effect of network size on the choice of self-employment. It may be a result of reverse causation, which is also interesting in itself but not exactly what we intend to study here.
Another issue with the network size measure is the concern of measurement error. During the survey, a respondent has to recall how many friends and relatives he greeted. Due to imperfect memory or lack of effort to do an accurate count, a respondent tends to report a number that appears to be a best guess. As we can see in Figure 2, most surveyed individuals reported round numbers, numbers that are multiples of five or ten. There is no reason to believe, for example, that a person is so much more likely to have actually greeted twenty than nineteen friends or relatives. Thus the spiky distributions in Figure 2 are almost surely a result of rounding or misreporting. As is well known, classical measurement errors in the independent variable will bias the OLS coefficient toward zero. Therefore, even if a larger social-family network indeed increases the probability of self-employment, a simple OLS regression may fail to identify a statistically significant effect because of errors in the measurement of network size.
The standard technique to overcome these reverse-causation and measurement-error problems is to instrument for the independent variable, which is the approach we take here. That is, we will use an instrumental variable that is correlated with the network size but does not affect the choice of self-employment through any other unaccounted channels. The particular IV we will use is the distance from home province when a migrant first left his village to work in the urban sector.
More specifically, we construct a distance variable using information about a migrant’s home address and the province he migrated to when he first left his village.9 Since this first migration typically occurred a few years ago (with a median of six years ago) and the RUMiC project focuses on the migrant’s current situation, the survey did not ask about the exact destination of the first migration at the sub-provincial level. So we can only construct a distance variable at the province level. For each migrant, we calculate the log railway distance between the capital of the home province and the capital of the first destination province.10 If the home province is the same as the first destination province, we set the log distance equal to zero.
We expect, and the data have confirmed, that the distance of the first migration is correlated with the number of friends and relatives greeted during the past Spring Festival. The reason is simple. For people who grew up in rural China, their social and family networks are highly local because they usually interact with and marry with other people in the same or nearby villages. A person who migrated far away would have been disconnected from many individuals in his original network for a considerable period of time. This is true even if the migrant later moved to a city closer to his home village. Because of this disruption, he tended to lose contact with some friends and relatives in his network. In the meantime, because he moved far away from home, he tended to know few locals and thus had difficulty in developing a new network.
Our key identifying assumption is that the distance of the first migration does not affect today’s choice of self-employment through any other channels that are not controlled for in our regressions. We cannot test this assumption but believe it is plausible given the specific context of rural–urban migration in China and the particular samples of migrants used for estimation.
In recent years, as rural–urban migration has become an increasingly prominent social phenomenon in China, many field studies have been conducted to document the life experiences of these migrants.11 We have therefore learned a great deal about the process of their migration decisions from both anecdotal and statistical evidence. The key fact to keep in mind is that a typical villager in China has no chance to travel to many places and has very limited information about how the urban economy is organized in different cities. It is clear that the migration is usually triggered by a need or an urge to improve one’s individual or family economic conditions. But the initial migration location is mostly an accidental choice not based on an informed calculation of feasibility and potential returns of different locations.
A migrant almost always chose the first city because he happened to know someone who was already there. It could be a relative, a neighbor, a friend, or simply an acquaintance who already migrated to that city and demonstrated that it might be feasible for this person to do the same thing (Zhao, 1999, 2003).12 Also, because the migration was not meant to be permanent, the first-timers tend to have a trial-and-error attitude: “Let me give it a shot and see what happens.” For this reason, when looking at a random sample of migrants, it seems reasonable to think of their first migration distance as random, especially after controlling for home province fixed effects. That is, given two first-time migrants from the same province, whether one went farther away than the other is likely to be exogenous, driven mostly by whether one happened to know someone who had migrated far away. Note that we do not need this distance to be completely random; we only need it to be exogenous to the choice of self-employment today.
The most serious threat to the credibility of our identification strategy is that the first migration destination and the type of the first job in urban sectors (whether self-employed or not) may be jointly determined. If this is true, it is problematic to think of the distance of first migration as exogenous to a migrant’s self-employment decision, especially for those who are still in their first jobs in cities today. To overcome this problem, in our empirical analysis below, we will focus on the sample of migrants who did not start as self-employed and who are not in their first jobs today. In other words, we will examine the sample of migrants who all moved to urban areas to work for some employers and all changed their jobs over time. Some of them would change from wage workers to self-employment and others would remain as wage workers but have moved to different employers. We then ask the following empirical question: Among the rural–urban migrants who started as wage workers and later changed their jobs, who are more likely to have chosen self-employment today? Because all the migrants in this sample started as wage workers in urban sectors, it is much more plausible to assume that their first migration destinations were not chosen for the purpose of self-employment. It is thus reasonable to exclude the distance of the first migration from the main equation that explains a migrant’s self-employment status today.
Another threat to the credibility of our identification strategy is the possibility that the distance of first migration is correlated with some unobserved characteristics of the migrant that in turn are correlated with the migrant’s choice of self-employment. In that case, the distance is not a valid instrumental variable. A most plausible scenario is perhaps that the more adventurous individuals are more likely to migrate far away from home and those people are also more willing to take risks and therefore more likely to choose self-employment. As it turns out, we find that individuals who migrated far away the first time tend to have a smaller social-family network today and are less likely to be self-employed today. Therefore, this concern about unobserved attitude toward risks actually works against our findings. In particular, if it is indeed true that the less risk-averse individuals tend to migrate a longer distance and are more likely to choose self-employment, then the true effect of network size is even higher than what we find. That is, our IV estimate can be thought of as a lower bound of the true effect.
Our main estimating equation is as follows:
$$ {y}_{ji}=\alpha +\beta {s}_{ji}+{X}_{ji}\gamma +H{P}_j+{\varepsilon}_{ji}, $$
(1)
where the outcome variable y
ji
is a dummy variable taking value 1 if migrant i from province j is self-employed; s
ji
is the key independent variable that measures the size of social-family network for this individual; X
ji
is a vector of control variables including the migrant’s age, gender, years of schooling, marital status, number of children, and years since the person first migrated out of rural area; HP
j
is a home-province fixed effect that captures the effect of all unobserved factors common to migrants from province j; and ε
ji
is the error term.
When using the IV strategy, we estimate two-stage least squares (2SLS) regressions with the following first-stage equation:
$$ {s}_{ji}=\kappa +\varphi {d}_{ji}+{X}_{ji}\lambda +H{P}_j+{\mu}_{ji}, $$
(2)
where d
ji
is the log-distance between the home and destination provinces when individual i from province j first migrated to a city. Predicted s
ji
from this first-stage regression are then used for estimating equation (1) in the second stage.